{"title": "Automatic Annotation of Everyday Movements", "book": "Advances in Neural Information Processing Systems", "page_first": 1547, "page_last": 1554, "abstract": "", "full_text": "Automatic Annotation of Everyday Movements\n\nDeva Ramanan and D. A. Forsyth\n\nComputer Science Division\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\nramanan@cs.berkeley.edu, daf@cs.berkeley.edu\n\nAbstract\n\nThis paper describes a system that can annotate a video sequence with:\na description of the appearance of each actor; when the actor is in view;\nand a representation of the actor\u2019s activity while in view. The system does\nnot require a \ufb01xed background, and is automatic. The system works by\n(1) tracking people in 2D and then, using an annotated motion capture\ndataset, (2) synthesizing an annotated 3D motion sequence matching the\n2D tracks. The 3D motion capture data is manually annotated off-line\nusing a class structure that describes everyday motions and allows mo-\ntion annotations to be composed \u2014 one may jump while running, for\nexample. Descriptions computed from video of real motions show that\nthe method is accurate.\n\n1. Introduction\nIt would be useful to have a system that could take large volumes of video data of people\nengaged in everyday activities and produce annotations of that data with statements about\nthe activities of the actors. Applications demand that an annotation system: is wholly auto-\nmatic; can operate largely independent of assumptions about the background or the number\nof actors; can describe a wide range of everyday movements; does not fail catastrophically\nwhen it encounters an unfamiliar motion; and allows easy revision of the motion descrip-\ntions that it uses. We describe a system that largely has these properties. We track multiple\n\ufb01gures in video data automatically. We then synthesize 3D motion sequences matching\nour 2D tracks using a collection of annotated motion capture data, and then apply the an-\nnotations of the synthesized sequence to the video.\nPrevious work is extensive, as classifying human motions from some input is a matter of\nobvious importance. Space does not allow a full review of the literature; see [1, 5, 4, 9, 13].\nBecause people do not change in appearance from frame to frame, a practical strategy is to\ncluster an appearance model for each possible person over the sequence, and then use these\nmodels to drive detection. This yields a tracker that is capable of meeting all our criteria,\ndescribed in greater detail in [14]; we used the tracker of that paper. Leventon and Freeman\nshow that tracks can be signi\ufb01cantly improved by comparison with human motion [12].\nDescribing motion is subtle, because we require a set of categories into which the motion\ncan be classi\ufb01ed; except in the case of speci\ufb01c activities, there is no known natural set of\ncategories. Special cases include ballet and aerobic moves, which have a clearly established\ncategorical structure [5, 6]. In our opinion, it is dif\ufb01cult to establish a canonical set of\nhuman motion categories, and more practical to produce a system that allows easy revision\nof the categories (section 2).\n\nFigure 1 shows an overview of our approach to activity recognition. We use 3 core compo-\nnents; annotation, tracking, and motion synthesis. Initially, a user labels a collection of 3D\nmotion capture frames with annotations (section 2). Given a new video sequence to anno-\ntate, we use a kinematic tracker to obtain 2D tracks of each \ufb01gure in sequence (section 3).\n\n\fuser\n\nannotations\n\n3D motion\n\nlibrary\n\n2D tracks\n\ntracker\n\nvideo\n\nmotion\nsynthesis\n\nFigure 1: Our annotation system consists of 3 main components; annotation, tracking,\nand motion synthesis (the shaded nodes). A user initially labels a collection of 3D mo-\ntion capture frames with annotations. Given a new video sequence to annotate, we use a\nkinematic tracker to obtain 2D tracks of each \ufb01gure in sequence. We then synthesize 3D\nmotion sequences which look like the 2D tracks by lifting tracks to 3D and matching them\nto our annotated motion capture library. We accept the annotations associated with the\nsynthesized 3D motion sequence as annotations for the underlying video sequence.\n\nWe then synthesize 3D motion sequences which look like the 2D tracks by lifting tracks\nto 3D and matching them to our annotated motion capture library (section 4). We \ufb01nally\nsmooth the annotations associated with the synthesized 3D motion sequence (section 5),\naccepting them as annotations for the underlying video sequence.\n\n2. Obtaining Annotated Data\n\nWe have annotated a body of motion data with an annotation system, described in detail\nin [3]; we repeat some information here for the convenience of the reader.\n\nThere is no reason to believe that a canonical annotation vocabulary is available for every-\nday motion, meaning that the system of annotation should be \ufb02exible. Annotations should\nallow for composition as one can wave while walking, for example. We achieve this by\nrepresenting each separate term in the vocabulary as a bit in a bit string. Our annotation\nsystem attaches a bit string to each frame of motion. Each bit in the string represents anno-\ntation with a particular element of the vocabulary, meaning that elements of the vocabulary\ncan be composed arbitrarily.\n\nActual annotation is simpli\ufb01ed by using an approach where the user bootstraps a classi\ufb01er.\nOne SVM classi\ufb01er is learned for each element of the vocabulary. The user annotates a\nseries of example frames by hand by selecting a sequence from the motion collection; a\nclassi\ufb01er is then learned from these examples, and the user reviews the resulting annota-\ntions. If they are not acceptable, the user revises the annotations at will, and then re-learns\na classi\ufb01er. Each classi\ufb01er is learned independently. The classi\ufb01er itself uses a radial basis\nfunction kernel, and uses the joint positions for one second of motion centered at the frame\nbeing classi\ufb01ed as a feature vector. Since the motion is sampled in time, each joint has\na discrete 3D trajectory in space for the second of motion centered at the frame. In our\nimplementation, we used a public domain SVM library (libsvm [7]). The out of margin\ncost for the SVM is kept high to force a good \ufb01t within the capabilities of the basis function\napproximation.\n\na\n\nreference\n\ncollection consists of\ntotal of 7 minutes of motion cap-\nOur\nture data.\nThe vocabulary that we chose to annotate this database con-\nsisted of: run, walk, wave, jump, turn left, turn right, catch,\nreach, carry, backwards, crouch, stand, and pick up. Some of these\nannotations co-occur: turn left while walking, or catch while jumping and\nrunning. Our approach admits any combination of annotations, though some combina-\ntions may not be used in practice: for example, we can\u2019t conceive of a motion that should\nbe annotated with both stand and run. A different choice of vocabulary would be appro-\npriate for different collections. The annotations are not required to be canonical. We have\nveri\ufb01ed that a consistent set of annotations to describe a motion set can be picked by asking\npeople outside our research group to annotate the same database and comparing annotation\nresults.\n\n\f3. Kinematic Tracking\nWe use the tracker of [14], which is described in greater detail in that paper. We repeat\nsome information here for the convenience of the reader. The tracker works by building\nan appearance model of putative actors, detecting instances of that model, and linking the\ninstances across time.\nThe appearance model approximates a view of the body as a puppet built of colored,\ntextured rectangles. The model is built by applying detuned body segment detectors to\nsome or all frames in a sequence. These detectors respond to roughly parallel contrast\nenergies at a set of \ufb01xed scales (one for the torso and one for other segments). A detector\nresponse at a given position and orientation suggests that there may be a rectangle there.\nFor the frames that are used to build the model, we cluster together segments that are\nsuf\ufb01ciently close in appearance \u2014 as encoded by a patch of pixels within the segment \u2014\nand appear in multiple frames without violating upper bounds on velocity. Clusters that\ncontain segments that do not move at any point of the sequence are then rejected. The next\nstep is to build assemblies of segments that lie together like a body puppet. The torso is\nused as a root, because our torso detector is quite reliable. One then looks for segments that\nlie close to the torso in multiple frames to form arm and leg segments. This procedure does\nnot require a reliable initial segment detector, because we are using many frames to build a\nmodel \u2014 if a segment is missed in a few frames, it can be found in others. We are currently\nassuming that each individual is differently dressed, so that the number of individuals is\nthe number of distinct appearance models. Detecting the learned appearance model in the\nsequence of frames is straightforward [8].\n4. 3D Motion Synthesis\nOnce the 2D con\ufb01guration of actors has been identi\ufb01ed, we need to synthesize a sequence\nof 3D con\ufb01gurations matching the 2D reports. Maintaining a degree of smoothness \u2014 i.e.\nensuring that not only is a 3D representation a good match to the 2D con\ufb01guration, but also\nlinks well to the previous and future 3D representations \u2014 is a needed because the image\ndetection is not perfect. We assume that camera motion can be recovered from a video\nsequence and so we need only to recover the pose of the root of the body model \u2014 in our\ncase, the torso \u2014 with respect to the camera.\nRepresenting Body Con\ufb01guration: We assume the camera is orthographic and is oriented\nwith the y axis perpendicular to the ground plane, by far the most important case. From\nthe puppet we can compute 2D positions for various key points on the body (we use the\nleft-right shoulder, elbow, wrist, knee, ankle and the upper & lower torso). We represent\nthe 2D key points with respect to a 2D torso coordinate frame. We analogously convert the\nmotion capture data to 3D key points represented with respect to the 3D torso coordinate\nframe.\n\nWe assume that all people are within an isotropic scaling of one another. This means that\nthe scaling of the body can be folded in with the camera scale, and the overall scale is\nbe estimated using corresponding limb lengths in lateral views (which can be identi\ufb01ed\nbecause they maximize the limb lengths). This strategy would probably lead to dif\ufb01culties\nif, for example, the motion capture data came from an individual with a short torso and long\narms; the tendency of ratios of body segment lengths to vary from individual to individual\nand with age is a known, but not well understood, source of trouble in studies of human\nmotion [10].\n\nOur motion capture database is too large for us to use every frame in the matching process.\nFurthermore, many motion fragments are similar \u2014 there is an awful lot of running \u2014\nso we vector quantize the 11,000 frames down to k = 300 frames by clustering with\nk-means and retaining only the cluster medoids. Our distance metric is a weighted sum\nof differences between 3D key point positions, velocities, and accelerations ([2] found\nthis metric suf\ufb01cient to ensure smooth motion synthesis). The motion capture data are\n\n\fM\n\n1t\n\n1M\n\nT 1\n\n1M\n\nM 2\n\nM 3\n\n1M\n\nM 2\n\nM 3\n\nT\n\nm\n\n1m\n\nVariables\n\n(a)\n\nDirected model\n\n(b)\n\nt\n\n1M\n\nT 1\n\nUndirected\n\nmodel\n(c)\n\nT 1\n\nT 2\n\nT 3\n\nT 1\n\nT 2\n\nT 3\n\nFactorial HMM\n\n(d)\n\nTriangulated FHMM\n\n(e)\n\nFigure 2: In (a), the variables under discussion in camera inference. M is a representation\nof \ufb01gure in 3D with respect to its root coordinate frame, m is the partially observed vector\nof 2D key points, t is the known camera position and T is the position of the root of the 3D\n\ufb01gure. In (b) a camera model for frame i where 2D keypoints are dependent on the camera\nposition, 3D \ufb01gure con\ufb01guration, and the root of the 3D \ufb01gure. A simpli\ufb01ed undirected\nmodel in (c) is obtained by marginalizing out the observed variables yielding a single\npotential on M i and T i. In (d), the factorial hidden Markov model obtained by extending\nthe undirected model across time. As we show in the text, it is unwise to yield to the\ntemptation to cut links between T \u2019s (or M\u2019s) to obtain a simpli\ufb01ed model. However, our\nFHMM is tractable, and yields the triangulated model in (e).\n\nrepresented at the same frame rate as the video, to ensure consistent velocity estimates.\nModeling Root Con\ufb01guration: Figure 2 illustrates our variables. For a given frame, we\nhave unknowns M, a vector of 3D key points and T , the 3D global root position. Known\nare m, the (partially) observed vector of 2D key points, and t, the known camera position.\nIn practice, we do not need to model the translations for the 3D root (which is the torso); our\ntracker reports the (x; y) image position for the torso, and we simply accept these reports.\nThis means that T reduces to a single scalar representing the orientation of the torso along\nthe ground plane. The relative out of image plane movement of the torso (in the z direction)\ncan be recovered from the \ufb01nal inferred M and T values by integration \u2014 one sums the\nout of plane velocities of the rotated motion capture frames.\n\nFigure 2 shows the directed graphical model linking these variables for a single frame.\nThis model can be converted to an undirected model \u2014 also shown in the \ufb01gure \u2014 where\nthe observed 2D key points specify a potential between Mi and Ti. Write the potential\nfor the ith frame as  viewi(Mi; Ti). We wish to minimize image error, so it is natural\nto use backprojection error for the potential. This means that  viewi(Mi; Ti) is the mean\nsquared error between the visible 2D key points mi and the corresponding 3D keypoints\nMi rendered at orientation Ti. To handle left-right ambiguities, we take the minimum error\nover all left-right assignments. To incorporate higher-order dynamic information such as\nvelocities and accelerations, we add keypoints from the two preceding and two following\nframes when computing the mean squared error.\n\nWe quantize the torso orientation Ti into a total of c = 20 values. This means that the\npotential  viewi(Mi; Ti) is represented by a c (cid:2) k table (recall that k is the total number\nof motion capture medoids used, section 4).\n\nWe must also de\ufb01ne a potential linking body con\ufb01gurations in time, representing the conti-\nnuity cost of placing one motion after another. We write this potential as  link(Mi; Mi+1).\nThis is a k (cid:2) k table, and we set the (i, j)\u2019th entry of this table to be the distance between\nthe j\u2019th medoid and the frame following the i\u2019th medoid, using the metric used for vector\nquantizing the motion capture dataset (section 4).\nInferring Root Con\ufb01guration: The model of \ufb01gure 2-(d) is known as a factorial hid-\nden Markov model (FHMM) where observations have been marginalized out and is quite\ntractable. Exact inference requires triangulating the graph (\ufb01gure 2-(e)) to make explicit\nadditional probabilistic dependencies [11].The maximum clique size is now 3, making in-\nference O(k2cN ) (where N is the number of total frames). Furthermore, the triangulation\nallows us to explicitly de\ufb01ne the potential  torso(Mi; Ti; Ti+1) to capture the dependency\n\n\fl\na\nu\nn\na\nM\n\nc\ni\nt\na\nm\no\nt\nu\nA\n\npresent\nfFace\nClosed\nExtend\n\nPresent\nFFace\nStand\nWalk\nWave\nPick up\nJump\nReach\nCrouch\nLTurn\nRTurn\nCarry\nRun\nCatch\nBkwd\n\ntime\n\npresent\nfFace\nClosed\nExtend\n\nPresent\nFFace\nStand\nWalk\nWave\nPick up\nJump\nReach\nCrouch\nLTurn\nRTurn\nCarry\nRun\nCatch\nBkwd\n\ntime\n\nFigure 3: Unfamiliar con\ufb01gurations can either be annotated with \u2019null\u2019 or with the closest\nmatch. We show smoothed annotation results for a sequence of jumping jacks (sometimes\nknown as star jumps) from two such annotation systems.\nIn the top row, we show the\nsame two frames run through each system. The MAP reconstruction of the human \ufb01gure\nobtained from the tracking data has been reprojected back to the image, using the MAP\nestimate of camera con\ufb01guration. In the bottom, we show signals representing annota-\ntion bits over time. The manual annotator records whether or not the \ufb01gure is present,\nfront faceing, in a closed stance, and/or in an extended stance. The automatic\nannotation consists of a total of 16 bits; present, front faceing, plus the 13 bits from\nthe annotation vocabulary of Sec.2. In \ufb01rst dotted line, corresponding to the image above\nit, the manual annotator asserts the \ufb01gure is present, frontally faceing, and about\nto reach the extended stance. The automatic annotator asserts the \ufb01gure is present,\nfrontally faceing, and walking and waveing, and is not standing, not jumping, etc.\nThe annotations for both systems are reasonable given there are no corresponding cate-\ngories available (this is like describing a movement that is totally unfamiliar). On the left,\nwe freely allow \u2019null\u2019 annotations (where no annotation bit is set). On the right, we dis-\ncourage \u2019null\u2019 annotations as described in Sec.6. Con\ufb01gurations near the closed stance\nare now labeled as standing, a reasonable approximation.\n\nof torso angular velocity on the given motion. For example, we expect the torso angular\nvelocity of a turning motion frame to be different from a walking forward frame. We set\na given entry of this table to be the squared error between the sampled angular velocity\n(Ti+1 (cid:0) Ti, shifted to lie between (cid:0)(cid:25) : : : (cid:25)) and the actual torso angular velocity of the\nmedoid Mi.\nWe scale the  viewi(Mi; Ti),  link(Mi; Mi+1), and  torso(Mi; Ti; Ti+1) potentials by\nempirically determined values to yield satisfactory results. These scale factors are weight\nthe degree to which the \ufb01nal 3D track should be continuous versus the degree to which\nit should match the 2D data. In principle, these weights could be set optimally by a de-\ntailed study of the properties of our tracker, but we have found it simpler to set them by\nexperiment.\n\nWe \ufb01nd the maximum a posteriori (MAP) estimate of Mi and Ti by a variant of dynamic\nprogramming de\ufb01ned for clique trees [11]. Since we implicitly used negative log likeli-\nhoods to de\ufb01ne the potentials (the squared error terms), we used the min-sum variant of the\nmax-product algorithm.\nPossible Variants: One might choose to not enforce consistency in the root orientation Ti\nbetween frames. By breaking the links between the Ti variables in \ufb01gure 2-(a), we could\n\n\fl\na\nu\nn\na\nM\n\nc\ni\nt\na\nm\no\nt\nu\nA\n\npresent\nrFace\nlFace\nwalk\nstop\nbkwd\nPresent\nRFace\nLFace\nWalk\nStand\nBkwd\nLTurn\nWave\nPick up\nJump\nReach\nCrouch\nRTurn\nCarry\nRun\nCatch\n\ntime\n\npresent\nrFace\nlFace\nwalk\nstop\nbkwd\nPresent\nRFace\nLFace\nWalk\nStand\nBkwd\nLTurn\nWave\nPick up\nJump\nReach\nCrouch\nRTurn\nCarry\nRun\nCatch\n\ntime\n\npresent\nrFace\nlFace\nwalk\nstop\nbkwd\nPresent\nRFace\nLFace\nWalk\nStand\nBkwd\nLTurn\nWave\nPick up\nJump\nReach\nCrouch\nRTurn\nCarry\nRun\nCatch\n\ntime\n\nFigure 4: We show annotation results for a walking sequence from three versions of our\nsystem using the notation of Fig.3. Null matches are allowed. On the left, we infer the 3D\ncon\ufb01guration M i (and associated annotation) independently for each frame, as discussed\nin Sec.4. In the center, we model temporal dependencies when inferring M i and its corre-\nsponding annotation. On the right, we smooth the annotations, as discussed in Sec.5. Each\nimage is labeled with an arrow pointing in the direction the inferred \ufb01gure is facing, not\nmoving. By modeling camera dependencies, we are able to \ufb01x incorrect torso orientations\npresent in the left system (i.e., the \ufb01rst image frame and the automatic left faceing and\nright faceing annotation bits). By smoothing the annotations, we eliminate spurious\nstand\u2019s present in the center. Although the smoothing system correctly annotates the\nlast image frame with backward, the occluded arm incorrectly triggers a wave, by the\nmechanism described in Sec.5.\n\nreduce our model to a tree and make inference even simpler \u2014 we now have an HMM.\nHowever, this is simplicity at the cost of wasting an important constraint \u2014 the camera\ndoes not \ufb02ip around the body from frame to frame. This constraint is useful, because\nour current image representation provides very little information about the direction of\nmovement in some cases. In particular, in a lateral view of a \ufb01gure in the stance phase of\nwalking it is very dif\ufb01cult to tell which way the actor is facing without reference to other\nframes \u2014 where it may not be ambiguous. We have found that if one does break these\nlinks, the reconstruction regularly \ufb02ips direction around such frames.\n5. Reporting Annotations\nWe now have MAP estimates of the 3D con\ufb01guration f ^Mig and orientation f ^Tig of the\nbody for each frame. The simplest method for reporting annotations is to produce an anno-\ntation that is some function of f ^Mig. Recall that f ^Mig is one of the medoids produced by\nour clustering process (section 4). It represents a cluster of frames, all of which are similar.\nWe could now report either the annotation of the medoid, the annotation that appears most\nfrequently in the cluster, the annotation of the cluster element that matches the image best,\nor the frequency of annotations across the cluster.\n\nThe fourth alternative produces results that may be useful for some kinds of decision-\nmaking, but are very dif\ufb01cult to interpret directly \u2014 each frame generates a posterior prob-\nability over the annotation vocabulary \u2014 and we do not discuss it further here. Each of the\n\ufb01rst three tends to produce choppy annotation streams (\ufb01gure 4, center). This is because\nwe have vector quantized the motion capture frames, meaning that  link(Mi; Mi+1) is a\n\n\fl\na\nu\nn\na\nM\n\nc\ni\nt\na\nm\no\nt\nu\nA\n\npresent\nrFace\nlFace\nrun\ncatch\nthrow\nPresent\nRFace\nLFace\nRun\nWalk\nCatch\nWave\nReach\nPick up\nJump\nStand\nCrouch\nLTurn\nRTurn\nCarry\nBkwd\n\ntime\n\npresent\nrFace\nlFace\nrun\ncatch\nthrow\nPresent\nRFace\nLFace\nRun\nWalk\nCatch\nWave\nReach\nPick up\nJump\nStand\nCrouch\nLTurn\nRTurn\nCarry\nBkwd\n\ntime\n\npresent\nrFace\nlFace\nrun\ncatch\nthrow\nPresent\nRFace\nLFace\nRun\nWalk\nCatch\nWave\nReach\nPick up\nJump\nStand\nCrouch\nLTurn\nRTurn\nCarry\nBkwd\n\ntime\n\nFigure 5: Smoothed annotations of 3 \ufb01gures from a video sequence of the three passing a\nball back and forth using the conventions of \ufb01gure 3. Null matches are allowed. The dashed\nvertical lines indicate annotations corresponding to the frames shown. The automatic an-\nnotations are largely accurate: the \ufb01gures are correctly identi\ufb01ed, and the direction in\nwhich the \ufb01gures are facing are largely correct. There is some confusion between run\nand walk, and throws appear to be identi\ufb01ed as waves and reaches. Generally, when\nthe \ufb01gure has the ball (after catching and before throwing, as denoted in the manual\nannotations), he is annotated as carrying, though there is some false detection. There\nare no spurious crouches, turns, etc.\nfairly rough approximation of a smoothness constraint (because some frames in one cluster\nmight link well to some frames in another and badly to others in that same cluster). An\nalternative is to smooth the annotation stream.\nSmoothing Annotations: Recall that we have 13 terms in our annotation vocabulary, each\nof which can be on or off for any given frame. Of the 213 possible bit strings, we observe\na total of 32 in our set of motions. Clearly, we cannot smooth annotation bits directly,\nbecause we might very likely create bit strings that never occur. Instead, we regard each\nobserved annotation string as a codeword.\n\nWe can model the temporal dynamics of codewords and their quantized observations using\na standard HMM. The hidden state is the code word, taking on one of l (= 32) values,\nwhile the observed state is the cluster, taking on one of k (= 300) values. This model is\nde\ufb01ned by a l (cid:2) l matrix representing codeword dynamics and a l (cid:2) k matrix representing\nthe quantized observation. Note that this model is fully observed in the 11,000 frames of\nthe motion database; we know the true code word for each motion frame and the cluster\nto which the frame belongs. Hence we can learn both matrices through straightforward\nmultinomial estimation. We now apply this model to the MAP estimate of f ^Mig, inferring\na sequence of annotation codewords (which we can later expand back into annotation bit\nvectors).\nOcclusion:When a limb is not detected by the tracker, the con\ufb01guration of that limb is not\nscored in evaluating the potential. In turn, this means that the best con\ufb01guration consistent\nwith all else detected is used, in this case with the \ufb01gure waving (\ufb01gure 4). In an ideal\nclosed world, we can assume the limb is missing because its not there; in practice, it may\nbe due to a detector failure. This makes employing \u201cnegative evidence\u201d dif\ufb01cult.\n6. Experimental Results\nIt is dif\ufb01cult to evaluate results simply by recording detection information (say an ROC\nfor events). Furthermore, there is no meaningful standard against which one can compare.\nInstead, we lay out a comparison between human and automatic annotations, as in Fig.3,\nwhich shows annotation results for a 91 frame jumping jack (or star jump) sequence. The\n\n\ftop 4 lower case annotations are hand-labeled over the entire 91 frame sequence. Generally,\nautomatic annotation is successful: the \ufb01gure is detected correctly, oriented correctly (this\nis recovered from the torso orientation estimates Ti), and the description of the \ufb01gure\u2019s\nactivities is largely correct.\n\nFig.4 compares three versions of our system on a 288 frame sequence of a \ufb01gure walking\nback and forth. Comparing the annotations on the left (where con\ufb01gurations have been\ninferred without temporal dependency) with the center (with temporary dependency), we\nsee temporal dependency in inferred con\ufb01gurations is important, because otherwise the\n\ufb01gure can change direction quickly, particularly during lateral views of the stance phase\nof a walk (section 4). Comparing the center annotations with those on the right (smoothed\nwith our HMM) shows that annotation smoothing makes it possible to remove spurious\njump, reach, and stand labels \u2014 the label dynamics are wrong.\n\nWe show smoothed annotations for three \ufb01gures from one sequence passing a ball back\nand forth in Fig.5; the sequence contains a lot of fast movement. Each actor is correctly\ndetected, and the system produces largely correct descriptions of the actor\u2019s orientation and\nactions. The inference procedure interprets a run as a combination of run and walk. Quite\noften, the walk annotation will \ufb01re as the \ufb01gure slows down to turn from face right\nto face left or vice versa. When the \ufb01gures use their arms to catch or throw, we see\nincreased activity for the similar annotations of catch, wave, and reach.\n\nWhen a novel motion is encountered, we want the system to either respond by (1) rec-\nognizing it cannot annotate this sequence, or (2) annotate it with the best match possible.\nWe can implement (2) by adjusting the parameters for our smoothing HMM so that the\n\u2019null\u2019 codeword (all annotation bits being off) is unlikely. In Fig.3, system (1) responds to\na jumping jack sequence (star jump, in some circles) with a combination of walking and\njumping while waveing. In system (2), we see an additional standing annotation for\nwhen the \ufb01gure is near the closed stance.\nReferences\n\n[1] J. K. Aggarwal and Q. Cai. Human motion analysis: A review. Computer Vision and Image\n\nUnderstanding: CVIU, 73(3):428\u2013440, 1999.\n\n[2] O. Arikan and D. Forsyth. Interactive motion generation from examples. In Proc. ACM SIG-\n\n2000.\n\n[9] D. M. Gavrila. The visual analysis of human movement: A survey. Computer Vision and Image\n\nUnderstanding: CVIU, 73(1):82\u201398, 1999.\n\n[10] J. K. Hodgins and N. S. Pollard. Adapting simulated behaviors for new characters. In SIG-\n\nGRAPH - 97, 1997.\n\n[11] M. I. Jordan, editor. Learning in Graphical Models. MIT Press, Cambridge, MA, 1999.\n[12] M. Leventon and W. Freeman. Bayesian estimation of 3D human motion from an image se-\n\nquence. Technical Report TR-98-06, MERL, 1998.\n\n[13] D. Ramanan and D. A. Forsyth. Automatic annotation of everyday movements. Technical\n\nreport, UCB//CSD-03-1262, UC Berkeley, CA, 2003.\n\n[14] D. Ramanan and D. A. Forsyth. Finding and tracking people from the bottom up.\n\nIn Proc\n\nCVPR, 2003.\n\nGRAPH, 2002.\n\nSIGGRAPH, 2003.\n\n[3] O. Arikan, D. Forsyth, and J. O\u2019Brien. Motion synthesis from annotations.\n\nIn Proc. ACM\n\n[4] A. Bobick. Movement, activity, and action: The role of knowledge in the perception of motion.\n\nPhilosophical Transactions of Royal Society of London, B-352:1257\u20131265, 1997.\n\n[5] A. F. Bobick and J. Davis. The recognition of human movement using temporal templates.\n\nIEEE T. Pattern Analysis and Machine Intelligence, 23(3):257\u2013267, 2001.\n\n[6] L. W. Campbell and A. F. Bobick. Recognition of human body motion using phase space\n\nconstraints. In ICCV, pages 624\u2013630, 1995.\n\n[7] C. C. Chang and C. J. Lin. Libsvm: Introduction and benchmarks. Technical report, Department\n\nof Computer Science and Information Engineering, National Taiwan University, 2000.\n\n[8] P. Felzenschwalb and D. Huttenlocher. Ef\ufb01cient matching of pictorial structures. In Proc CVPR,\n\n\f", "award": [], "sourceid": 2370, "authors": [{"given_name": "Deva", "family_name": "Ramanan", "institution": null}, {"given_name": "David", "family_name": "Forsyth", "institution": null}]}