{"title": "Multiple Instance Filtering", "book": "Advances in Neural Information Processing Systems", "page_first": 370, "page_last": 378, "abstract": "We propose a robust filtering approach based on semi-supervised and multiple instance learning (MIL). We assume that the posterior density would be unimodal if not for the effect of outliers that we do not wish to explicitly model. Therefore, we seek for a point estimate at the outset, rather than a generic approximation of the entire posterior. Our approach can be thought of as a combination of standard finite-dimensional filtering (Extended Kalman Filter, or Unscented Filter) with multiple instance learning, whereby the initial condition comes with a putative set of inlier measurements. We show how both the state (regression) and the inlier set (classification) can be estimated iteratively and causally by processing only the current measurement. We illustrate our approach on visual tracking problems whereby the object of interest (target) moves and evolves as a result of occlusions and deformations, and partial knowledge of the target is given in the form of a bounding box (training set).", "full_text": "Multiple Instance Filtering\n\nKamil Wnuk\n\nStefano Soatto\n\nUniversity of California, Los Angeles\n\n{kwnuk,soatto}@cs.ucla.edu\n\nAbstract\n\nWe propose a robust \ufb01ltering approach based on semi-supervised and mul-\ntiple instance learning (MIL). We assume that the posterior density would\nbe unimodal if not for the e\ufb00ect of outliers that we do not wish to explic-\nitly model. Therefore, we seek for a point estimate at the outset, rather\nthan a generic approximation of the entire posterior. Our approach can\nbe thought of as a combination of standard \ufb01nite-dimensional \ufb01ltering (Ex-\ntended Kalman Filter, or Unscented Filter) with multiple instance learning,\nwhereby the initial condition comes with a putative set of inlier measure-\nments. We show how both the state (regression) and the inlier set (classi-\n\ufb01cation) can be estimated iteratively and causally by processing only the\ncurrent measurement. We illustrate our approach on visual tracking prob-\nlems whereby the object of interest (target) moves and evolves as a result\nof occlusions and deformations, and partial knowledge of the target is given\nin the form of a bounding box (training set).\n\n1 Introduction\n\nAlgorithms for \ufb01ltering and prediction have a venerable history studded by quantum leaps by\nWiener, Kolmogorov, Mortensen, Zakai, Duncan among others. Many attempts to expand\n\ufb01nite-dimensional optimal \ufb01ltering beyond the linear-Gaussian case failed,1 which explains\nin part the resurgence of general-purpose approximation methods for the \ufb01ltering equation,\nsuch as weak-approximations (particle \ufb01lters [6, 16]) as well as parametric ones (e.g., sum-\nof-Gaussians or interactive multiple models [5]). Unfortunately, in many applications of\ninterest, from visual tracking to robotic navigation, the posterior is not unimodal. This has\nmotivated practitioners to resort to general-purpose approximations of the entire posterior,\nmostly using particle \ufb01ltering. However, in many applications one has reason to believe that\nthe posterior would be unimodal if not for the e\ufb00ect of outlier measurements, and therefore\nthe interest is in a point estimate, for instance the mode, mean or median, rather than in the\nentire posterior. So, we tackle the problem of \ufb01ltering, where the data is partitioned into\ntwo unknown subsets (inliers and outliers). Our goal is to devise \ufb01nite-dimensional \ufb01ltering\nschemes that will approximate the dominant mode of the posterior distribution, without\nexplicitly modeling the outliers. There is a signi\ufb01cant body of related work, summarized\nbelow.\n\n1.1 Prior related work\n\nOur goal is naturally framed in the classical robust statistical inference setting, whereby\nclassi\ufb01cation (inlier/outlier) is solved along with regression (\ufb01ltering). We assume that an\ninitial condition is available, both for the regressor (state) as well as the inlier distribution.\n\n1Also due to the non-existence of invariant family of distributions for large classes of Fokker-\n\nPlanck operators.\n\n1\n\n\fThe latter can be thought of as training data in a semi-supervised setting. Robust \ufb01ltering\nhas been approached from many perspectives: Using a robust norm (typically H\u221e or (cid:96)1)\nfor the prediction residual yields worst-case disturbance rejection [14, 9]; rejection sampling\nschemes in the spirit of the M-estimator [11] \u201crobustify\u201d classical \ufb01lters and their extensions.\nThese approaches work with few outliers, say 10\u2212 20%, but fail in vision applications where\none typically has 90% or more. Our approach relates to recent work in detection-based\ntracking [3, 10] that use semi-supervised learning [4, 18, 13], as well as multiple-instance\nlearning [2] and latent-SVM models [8, 20].\n\nIn [3] an ensemble of pixel-level weak classi\ufb01ers is combined on-line via boosting; this is\ne\ufb03cient but su\ufb00ers from drift; [10] improves stability by using a static model trained on\nthe \ufb01rst frame as a prior for labeling new training samples used to update an online clas-\nsi\ufb01er. MILTrack [4] addressed the problem of selecting training data for model update so\nas to maintain maximum discriminative power. This is related to our approach, except\nthat we have an explicit dynamical model, rather than a scanning window for detection.\nAlso, our discrimination criterion operates on a collection of parts/regions rather than a\nsingle template. This allows more robustness to deformations and occlusions. We adopt an\nincremental SVM with a fast approximation of a nonlinear kernel [21] rather than online\nboosting. Our part based representation and explicit dynamics allow us to better handle\nscale and shape changes without the need for a multi-scale image search [4, 13]. PROST [18]\nproposed a cascade of optical \ufb02ow, online random forest, and template matching. The P-N\ntracker [13] combined a median \ufb02ow tracker with an online random forest. New training\nsamples were collected when detections violated structural constraints based on estimated\nobject position. In an e\ufb00ort to control drift, new training data was not incorporated into\nthe model until the tracked object returned to a previously con\ufb01rmed appearance with high\ncon\ufb01dence. This meant that if object appearance never returned to the \u201ckey frames,\u201d the\nonline model would never be updated. In the aforementioned works objects are represented\nas a bounding box. Several recent approaches have also used segmentation to improve the\nreliability of tracking: [17] did not leverage temporal information beyond adjacent frames,\n[22] required several annotated input frames with detailed segmentations, and [7] relied on\ntrackable points on both sides of the object boundary. In all methods above there was no\nexplicit temporal modeling beyond adjacent frames; therefore the schemes had poor pre-\ndictive capabilities. Other approaches have used explicit temporal models together with\nsparsity constraints to model appearance changes [15].\n\nWe propose a semi-supervised approach to \ufb01ltering, with an explicit temporal model, that\nassumes imperfect labeling, whereby portions of the image inside the bounding box are\n\u201ctrue positives\u201d and others are outliers. This enables us to handle appearance changes, for\ninstance due to partial occlusions or changes of vantage point.\n\n1.2 Formalization\nWe denote with x(t) \u2208 Rn the state of the model at time t \u2208 Z+. It describes a discrete-\ntime trajectory in a \ufb01nite-dimensional (vector) space. This can be thought of as a real-\nization of a stochastic process that evolves via some kind of ordinary di\ufb00erence equation\nx(t + 1) = f (x(t)) + \u03bd(t), where \u03bd(t) IID\u223c p\u03bd is a temporally independent and identically\ndistributed process. We will assume that, possibly after whitening, the components of \u03bd(t)\nare independent.\nWe denote the set of measurements at time t with y(t) = {yi(t)}m(t)\nyi(t) \u2208 Rk. We\ni=1 ,\nassume each can be represented by some \ufb01xed dimensionality descriptor, \u03c6 : Rk \u2192 Rl; (y) \u2192\n\u03c6(y).\nIn classical \ufb01ltering, the measurements are a known function of the state, y(t) =\nh(x(t)) + n(t), up to the measurement noise, n(t), that is a realization of a stochastic\nprocess that is often assumed to be temporally independent and identically distributed,\nand also independent of \u03bd(t). In our case, however, the components of the measurement\nprocess y1(t), . . . , ym(t)(t) are divided into two groups: those that behave like standard\nmeasurements in a \ufb01ltering process, and those that do not.\nThis distinction is made by an indicator variable \u03c7(t) \u2208 {\u22121, 1}m(t) of the same dimension-\nality as the number of measurements, whose values are unknown, and can change over time.\n\n2\n\n\fFor brevity of notation we denote the two sets of indexes as \u03c7(t)+ = {i | \u03c7i(t) = 1} and\n\u03c7(t)\u2212 = {i | \u03c7i(t) = \u22121}. For the \ufb01rst set we have that {yi(t)}i\u2208\u03c7(t)+ = h(x(t), t)+n(t), just\nlike in classical \ufb01ltering, except that the measurement model h(\u00b7, t) is time-varying in a way\nthat includes singular perturbations, since the number of measurements changes over time,\nso the function h : Rn \u00d7 R \u2192 Rm(t); (x, t) (cid:55)\u2192 h(x, t) changes dimension over time. For the\nsecond group, unlike particle \ufb01ltering, we do not care to model their states, and instead just\ndiscount them as outliers. The measurements are thus samples from a stochastic process\nthat includes two independent sources of uncertainty: the measurement noise, n(t), and the\nselection process \u03c7(t).\n\nk=1), where the process \u03c7(t) has to be marginalized.\n\nOur goal is that of determining a point-estimate of the state x(t) given measurements up\nto time t. This will be some statistic (the mean, median, mode, etc.) of the conditional\ndensity p(x(t)|{y(k)}t\nIn order to design a \ufb01lter, we \ufb01rst consider the full forward model of how the various\nsamples of the inlier measurements are generated. To this end, we assume that the inlier\nset is separable from the outlier set by a hyper-plane in some feature space, represented\nby the normal vector w(t) \u2208 Rl. So, given the assignment of inliers and outliers \u03c7(t), we\nhave that the new maximal-margin boundary can be obtained from w(t \u2212 1) by several\niterations of a stochastic subgradient descent procedure [19], which for brevity we denote as\nw(t) = stochSubgradIters(w(t\u22121), y(t), \u03c7(t)) and describe in Sec. 2 and Sec. 2.2. Conversely,\nif we are given the hyperplane w(t), and state x(t), the measurements can be classi\ufb01ed via\n\u03c7(t) = argmin\u03c7 E(y(t), w(t), x(t), \u03c7). The energy function, E(y(t), w(t), x(t), \u03c7) depends on\nhow one chooses to model the object and what side information is applied to constrain the\nselection of training data. In the implementation details we give examples of how appearance\ncontinuity can be used as a constraint in this step. Further, motion similarity and occlusion\nboundaries could also be used.\n\nFinally, the forward (data-formation) model for a sample (realization) of the measurement\nprocess is given as follows: At time t = 0, we will assume that we have available an initial\ndistribution p(x0) together with an initial assignment of inliers and outliers \u03c70, so x(0) \u223c\n(cid:80)m(0)\np(x0);\n\u03c7(0) = \u03c70. Given \u03c7(0), we bootstrap our classi\ufb01er by minimizing a standard\ni=1 max(0, 1 \u2212\nsupport vector machine cost function: w(1) = argminw( \u03bb\n\u03c7i(0))(cid:104)w, \u03c6(yi(0))(cid:105)), where \u03bb \u2208 R is the tradeo\ufb00 between the importance of margin size\nversus loss. At all subsequent times t, each realization evolves according to:\n\n2||w||2 + 1\n\nm(0)\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\nx(t + 1) = f (x(t)) + v(t),\nw(t + 1) = stochSubgradIters(w(t), y(t), \u03c7(t)),\n\u03c7(t) = argmin\u03c7 E(y(t), w(t), x(t), \u03c7),\n{yi(t)}i\u2208\u03c7(t)+ = h(x(t), t) + n(t).\n\n(1)\n\nwhere the \ufb01rst two equations can be thought of as the \u201cmodel equations\u201d and the last two\nas the \u201cmeasurement equations.\u201d The presence of \u03c70 makes this a semi-supervised learning\nproblem, where \u03c70 is the \u201ctraining set\u201d for the process \u03c7(t). Note that it is possible for the\nmodel above to proceed in open-loop, when no inliers are present.\n\nThe model (1) can easily be extended to the case when the measurement equation is in\nimplicit form, h(x(t),{yi(t)}i\u2208\u03c7(t)+, t) = n(t), since all that matters is the innovation pro-\n= h({yi(t)}i\u2208\u03c7(t)+, \u02c6x(t), t). Additional extensions can be entertained where the\n.\ncess e(t)\ndynamics f depends on the classi\ufb01er w, so that x(t + 1) = f (x(t), w(t)) + v(t), and similarly\nfor the measurement equation h(x(t), w(t), t), although we will not consider them here.\n\n1.3 Application example: Visual tracking with shape and appearance changes\n\nObjects of interest (e.g. humans, cars) move in ways that result in a deformation of their\nprojection onto the image plane, even when the object is rigid. Further changes of ap-\npearance occur due to motion relative to the light source and partial occlusions. Because\nof the ambiguities in shape and appearance, one can \ufb01x one factor and model the other.\nFor instance, one can \ufb01x a bounding box (shape) and model change of appearance inside,\n\n3\n\n\fincluding outliers (due to occlusion) and inliers (newly visible portions of the object). Al-\nternatively, one can enforce constancy of the re\ufb02ectance function, but then shape changes\nas well as illumination must be modeled explicitly, which is complex [12].\nOur approach tracks the motion of a bounding box, enclosing the data inliers. Call c(t) \u2208 R2\nthe center of this bounding box, vc(t) \u2208 R2 the velocity of the center, d(t) \u2208 R2 the\nlength of the sides of the bounding box, and vd(t) \u2208 R2 its rate of change. Thus, we have\nx(t) = [c(t), vc(t), d(t), vd(t)]T . As before \u03c7(t) indicates a binary labeling of the measurement\ncomponents, where \u03c7(t)+ is the set of samples that correspond to the object of interest. We\nhave tested di\ufb00erent versions of our framework where the components are superpixels as\nwell as trajectories of feature points. For reasons of space limitation, below we describe the\ncase of superpixels, and report results for trajectories as supplementary material.\nConsider a time-varying image I(t) : D \u2282 R2 \u2192 R+; (u, v) (cid:55)\u2192 I(u, v, t): superpixels {Si}\nare just a partition of the domain D = \u222ar\ni=1Si with Si \u2229 Sj = \u03b4ij; \u03c7(t) becomes a binary\nlabeling of the superpixels, with \u03c7(t)+ collecting the indices of elements on the object of\ninterest, and \u03c7(t)\u2212 on the background.\nThe measurement equation is obtained as the centroid and diameter of the restriction of the\nbounding box to the domain of the inlier super-pixels: If y(t) = I(t) \u2208 RN\u00d7M is an image,\nthen h1({I(u, v, t)}(u,v)\u2208Si) \u2208 R2 is the centroid of the superpixels {Si}i\u2208\u03c7(t)+ computed\nfrom I(t), and h2({I(u, v, t)}(u,v)\u2208Si) \u2208 R2 is the diameter of the same region. This is in\nthe form (1), with h constant (the time dependency is only through y(t) and \u03c7(t)). The\n\nresulting model is: \uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3\n(cid:20) I\n\n(cid:21)\n\n0\n0\n\n0\nI\n\n0\n0\n\n0\n\nand n(t) IID\u223c N (0, R), R \u2208 R4\u00d74.\n\n2 Algorithm development\n\nx(t + 1) = F x(t) + \u03bd(t)\nw(t + 1) = stochSubgradIters(w(t), y(t), \u03c7(t))\n\u03c7(t) = argmin\u03c7E(y(t), w(t), x(t), \u03c7)\nh(yi(t)i\u2208\u03c7(t)+) = Cx(t) + n(t)\n\n(2)\n\n(cid:20) I\n\n(cid:21)\n\nwhere F \u2208 R8\u00d78 is block-diagonal with each 4\u00d7 4 block given by\n\n, C \u2208 R4\u00d78, C =\n, and I is the 2 \u00d7 2 identity matrix. Similarly, \u03bd(t) IID\u223c N (0, Q), Q \u2208 R8\u00d78\n\nI\nI\n\n0\n\n\uf8eb\uf8ed \u03bb\n\n2\n\nNf(cid:88)\n\nm(cid:88)\n\nt=1\n\ni=1\n\n\uf8f6\uf8f8 .\n\nWe focus our discussion in this section on the development of the discriminative appearance\nmodel at the heart of the inlier/outlier classi\ufb01cation, w(t). For simplicity, pretend for now\nthat each frame contains m observations.We assume an object is identi\ufb01ed with a subset of\nthe observations (inliers); at time t, we have {yi(t)}i\u2208\u03c7(t)+. Also pretend that observations\nfrom all frames, Y = {y(t)}Nf\nt=1, were available simultaneously; Nf is the number of frames\nIf all frames were labeled, (\u03c7(t) known \u2200 t), a maximum margin\nin the video sequence.\nclassi\ufb01er \u02c6w could be obtained by minimizing the objective (3) over all samples in all frames:\n\n\u02c6w = argmin\n\nw\n\n||w||2 +\n\n1\n\nmNf\n\n(cid:96)(w, \u03c6(yi(t)), \u03c7i(t))\n\n(3)\n\nwhere \u03bb \u2208 R, and (cid:96)(w, \u03c6(yi(t)), \u03c7i(t)) is a loss that ensures data \ufb01t. We use the hinge loss\n(cid:96)(w, \u03c6(yi(t)), \u03c7i(t)) = max(0, 1 \u2212 \u03c7i(t)(cid:104)w, \u03c6(yi(t))(cid:105)) in which slack is implicit, so we can use\nan e\ufb03cient sequential optimization in the primal form.\n\nIn reality an exact label assignment at every frame is not available, so we must infer the latent\nlabeling \u03c7 simultaneously while learning the hyperplane w. Continuing our hypothetical\nbatch processing scenario, pretend we have estimates of some state of the object throughout\ntime, \u02c6X = {\u02c6x(t)}Nf\nt=1. This allows us to identify a reduced subset of candidate inliers\n\n4\n\n\f(cid:33)\uf8f6\uf8f8 .\n\n\uf8eb\uf8ed \u03bb\n\n2\n\n(cid:32) m(cid:88)\n\nNf(cid:88)\n\nt=1\n\ni=1\n\n(in MIL terminology a positive bag), within which we assume all inliers are contained. The\nspeci\ufb01cation of a positive bag helps reduce the search space, since we can assume all samples\noutside of a positive bag are negative. This changes the SVM formulation to a mixed integer\nprogram similar to the mi-SVM [2], except that [2] assumed a positive/negative bag partition\nwas given, whereas we use the estimated state and add a term to the decision boundary\ncost function to express the dependence between the labeling, \u03c7(t), and state estimate, \u02c6x,\nat each time:\n\n\u02c6w, \u02c6\u03c7 = argmin\n\nw,\u03c7\n\n||w||2 +\n\n1\n\nmNf\n\nmax (0, 1 \u2212 \u03c7i(t)(cid:104)w, \u03c6(yi(t))(cid:105)) + E (y(t), \u03c7, \u02c6x(t))\n\n(4)\nHere E(y(t), \u03c7(t), \u02c6x(t)) represents a general mechanism to enforce constraints on label assign-\nment on a per-frame basis within a temporal sequence.2 A standard optimization procedure\nalternates between updating the decision boundary w, subject to an estimated labeling \u02c6\u03c7,\nfollowed by relabeling the original data to satisfy the positive bag constraints generated\nfrom the state estimates, \u02c6x, while keeping w \ufb01xed:\n\n\uf8f1\uf8f2\uf8f3 \u02c6w = argminw\n\n\u02c6\u03c7 = argmin\u03c7\n\n(cid:16) \u03bb\n(cid:80)Nf\nt=1 ((cid:80)m\n2||w||2 + 1\n\nmNf\n\n1\n\nmNf\n\n(cid:80)m\ni=1 max(0, 1 \u2212 \u02c6\u03c7i(t)(cid:104)w, \u03c6(yi(t))(cid:105))\n\n(cid:80)Nf\ni=1 max(0, 1 \u2212 \u03c7i(t)(cid:104) \u02c6w, \u03c6(yi(t))(cid:105)) + E(y(t), \u03c7(t), \u02c6x(t))) .\n\nt=1\n\n,\n\n(cid:17)\n\n(5)\nIn practice, annotation is available only in the \ufb01rst frame, and the data must be processed\ncausally and sequentially. Recently, [19] proposed an e\ufb03cient incremental scheme, PEGA-\nSOS, to solve the hinge loss objective in the primal form. This enables straightforward\nincremental training of w as new data becomes available. The algorithm operates on a\ntraining set consisting of tuples of labeled descriptors: T = {(\u03c6(yi), \u03c7i)}m\ni=1}. In a nutshell,\nat each PEGASOS iteration we select a subset of training samples from the current train-\ning set Aj \u2286 T , and update w according to wj+1 = wj \u2212 \u03b7j(cid:53)j. The subgradient of the\nhinge loss is given by (cid:53)j = \u03bbwj \u2212 1|Aj|\n\u03c7i\u03c6(yi). To \ufb01nalize the update and accelerate\nconvergence wj+1 is projected onto the set {w : ||w|| \u2264 1\u221a\n}, which [19] show is the space\ncontaining the optimal solution.\n\n(cid:80)\n\ni\u2208Aj\n\n\u03bb\n\nThe second objective of Eq. (5) seeks a solution to the binary integer program of inlier\nselection given \u02c6w and \u02c6x. Instead of tackling this NP-hard problem, we re-interpret it as a\nconstraint enforcement step based on additional cues within a search area speci\ufb01ed by our the\ncurrent state estimate. One example constraint for a superpixel based object representation\nis to re-interpret the given objective as a graph cut problem, with pairwise terms enforcing\nappearance consistency. See supplementary material for details, as well as for experiments\nwith other choices of constraints for tracks, rather than superpixels.\n\nInitialization\n\n2.1\nAt t = 0 we are given initial observations y(0) and a bounding box indicating the object of\ninterest {c(0) \u00b1 d(0)}. We initialize \u03c7(0) with positive indices corresponding to superpixels\nthat have a majority of their area |yi(0)| within the bounding box:\n\n\u03c7i(0) =\n\n|{c(0)\u00b1d(0)} \u2229 yi(0)|\n\n1 if\n|yi(0)|\n\u22121 otherwise.\n\n> \u0001y,\n\n(6)\n\n(cid:40)\n\nThe area threshold is \u0001y = 0.7 throughout all experiments. This represents a bootstrap\ntraining set, T1 from which we learn an initial classi\ufb01er w(1) for distinguishing object ap-\npearance. Each element of the training set is a triplet (\u03c6(yi(t)), \u03c7i(t), \u03c4i = t), where the\nlast element is the time at which the feature is added to the training set. We start by\nselecting all positive samples and a set number of negatives, nf , sampled randomly from\n\u03c7(0)\u2212, giving T1 = {(\u03c6(yi(0)), \u03c7i(0), 0)}\u2200i\u2208\u03c7(0)+ \u222a {(\u03c6(yj(0)), \u03c7j(0), 0) | j \u2208 \u03c7(0)\u2212\nrand \u2286\n\u03c7(0)\u2212,|\u03c7(0)\u2212\n\nrand| = nf}.\n\n2It represents the side information necessary to avoid zero information gain in the semi-\n\nsupervised inference procedure.\n\n5\n\n\f2.2 Prediction Step\n\nAt time t, given the current estimate of the object state and classi\ufb01cation \u03c7(t), we add all\npositive samples and di\ufb03cult negative samples lying outside of the estimated bounding box\nto the new training set Tt+1|t. We then propagate the object state with the model of motion\ndynamics and \ufb01nally update the decision boundary with the newly updated training set.\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n\u02c6x(t + 1|t) = F \u02c6x(t|t)\nP (t + 1|t) = F P (t|t)F T + Q\nTt+1\nTt+1,old\nTt+1,new\n\n= Tt+1,old \u222a Tt+1,new\n= {(\u03c6(yi), \u03c7i, \u03c4i) | \u03c7i(cid:104)\u03c6(yi), w(t)(cid:105) < 1, t \u2212 \u03c4i \u2264 \u03c4max}\n= {(\u03c6(yi(t)), \u03c7i(t), t) | \u03c7i(t) = 1} \u222a\n|yi(t)|\n\n{(\u03c6(yi(t)),\u22121, t) | |D/{\u02c6c(t|t)\u00b1 \u02c6d(t|t)} \u2229 yi(t)|\n\nw(t + 1) \u2190 for j = nT , ..., N (update starting with wnT = w(t))\n\nchoose Aj \u2286 Tt+1\n(cid:80)\nnj = 1\n\u03bbj\nwj+1 = (1 \u2212 \u03b7j\u03bb)wj + \u03b7j|Aj|\nwj+1 = min{1, 1/\n||wj+1||}wj+1\n\n\u221a\n\n\u03bb\n\nend\n\ni\u2208Aj\n\n\u03c7i(t)\u03c6(yi(t))\n\n\u2265 1 \u2212 \u0001y, (cid:104)\u03c6(yi(t)), w(t)(cid:105) > \u22121}\n\n(7)\nIt is typically not necessary to update w at every step, so training data can be collected\nover several frames during which w(t + 1) = w(t) and the update above can be invoked\neither at some regular interval, on demand, or upon some form of model validation as\nin [13]. The parameter \u03c4max determines memory of the classi\ufb01er update procedure for\ndi\ufb03cult examples.\nIf \u03c4max = 0, no memory is used and training data for model update\nconsists only of observations from the current image. Such a memory of recent training\nsamples is analogous to the training cache used in [8] for training the latentSVM model.\nDuring each classi\ufb01er update we perform N \u2212 nT iterations of the stochastic subgradient\ndescent algorithm, starting from the current best estimate of the separating hyperplane\nwnT = w(t). The overall number of iterations N is set as N = 20/\u03bb, where \u03bb is a function\nof the bootstrap training set size, \u03bb = 1/(10|T1|). The number in the denominator is used\nas a parameter to set the relative importance of the margin size and the loss, but we \ufb01x\nit at 10 for our experiments. The number of iterations at a new time is then decided by\nnT = max(1\u2212|Tt|/N, 0.75) in order to limit how much the hyperplane can change in a single\nupdate. These parameters can also be viewed as tuning the learning rates and forgetting\nfactors of the classi\ufb01er.\n\n2.3 Update Step\nThe innovation is in implicit form with h(yi(t + 1)i\u2208\u03c7(t+1)+) \u2208 R4 giving a tight bounding\nbox around the selected foreground regions in the same form as they appear in the state.\nIn the update equations r speci\ufb01es the size of the search region around the predicted state\nwithin which we consider observations as candidates for foreground; \u03be speci\ufb01es the indices\nof candidate observations (positive bag).\n\n= \u03bbr((cid:2) I\n\n0 (cid:3) diag(CP (t + 1|t)C T ) +(cid:2) 0\n\nI (cid:3) diag(CP (t + 1|t)C T ),\n\nr\n\u03be\n\u03c7(t + 1)\ne(t + 1)\nL\n\u02c6x(t + 1|t + 1) = \u02c6x(t + 1|t) + Le(t + 1)\nP (t + 1|t + 1) = (I \u2212 LC)P (t + 1|t)(I \u2212 LC)T + LRLT .\n\n= {i | |{c(t+1|t)\u00b1(d(t+1|t)+r)} \u2229 yi(t+1)|\n= argmin\u03c7\u2208{\u22121,1}m E(w(t + 1),{yi(t + 1)}i\u2208\u03be, \u02c6x(t + 1|t), \u03c7)\n= h(yi(t + 1)i\u2208\u03c7(t+1)+) \u2212 C \u02c6x(t + 1|t)\n= Pt+1|tC T (CPt+1|tC T + R)\u22121\n\n> Ey},\n\n|yi(t+1)|\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n(8)\nAbove \u03bbr \u2208 R is a factor (we \ufb01x it at 3) for scaling the region size based on \ufb01lter covariance.\n\n6\n\n\fFigure 1: Ski sequence: Left panel shows frame number, search area (black rectangle), \ufb01lter\nprediction (blue), observation (red), and updated \ufb01lter estimate (green). The center panels overlay\nthe SVM scores for each region (solid blue = \u22121, solid red = 1). Right panels show the regions\nselected as inliers. This challenging sequence includes viewpoint and scale changes, deformation,\nchanging background. The algorithm performs well and successfully recovers from missed detection\n(from frame 349 to 352 shown above).\n\nFigure 2: P-N tracker [13] (above) and MILTrack [4] (below) initialized with the same bounding box\nas our approach. Original implementations by the respective authors were used for this comparison.\nThe P-N tracker fails because of the absence of stable low-level tracks on the target and quickly\nlocks onto a patch of trees in the background. MILTrack survives longer but does not adapt scale\nquickly enough, eventually drifting to become a detector of the tree line.\n\n3 Experiments\n\nTo compare with [18, 4, 13], we \ufb01rst evaluate our discriminative model without maintaining\nany training data history \u03c4max = 0 and updating w every 6 frames, with training data\ncollected between incremental updates. Even with \u03c4max = 0 we can track highly deforming\nobjects (a skier) with signi\ufb01cant scale changes through most of the 1496 frames (Fig. 1).\nWe also recover from errors due to the implicit memory in the decision boundary from\nincremental updating. For comparison, [4, 13] quickly drift and fail to recover (Fig. 2).\n\nFor a quantitative comparison we test our full algorithm against the state of the art on\nthe PROST dataset [18] consisting of 4 videos with fast motion, occlusions, scale changes,\ntranslucency, and small background motions. In all experiments \u03c4max = 25, and all other\nparameters were \ufb01xed as described earlier and in supplementary material. Two evaluation\nmetrics are reported: the mean center location error in pixels [4], and percentage of correctly\ntracked frames as computed by the bounding box overlap criteria area(ROID\u2229ROIGT )\narea(ROID\u222aROIGT ) > 0.5,\n\n7\n\n\fFigure 3: Convergence of the classi\ufb01er: Samples from frames 113, 125, 733, and 1435 of the \u201cliquor\u201d\nsequence. The leftmost image shows the probabilities returned by the initial classi\ufb01er trained using\nonly the \ufb01rst frame, the second image shows the foreground probabilities returned from the current\nclassi\ufb01er, the third image shows the foreground selection made by the graph-cut step, and the \ufb01nal\nimage shows the smoothed score used to select bounding box location.\n\nwhere ROID is the detected region and ROIGT is the ground truth region. The ground\ntruth for the PROST dataset is reported using a constant sized bounding box. Table 1\ncompares to [18, 4, 1, 13].\n\nIn the liquor sequence our method correctly shrinks the bounding box to the label, since the\nrest of the bottle is not discriminative. Unfortunately, this is penalized in the Pascal score\nsince the area ratio drops below 0.5 of the initial bounding box despite perfect tracking. This\ncauses the score to drop to 18.9. If we modify the criterion to count as valid a detection\nwhere > 99% of the detection area lies within the annotated ground truth region, the score\nbecomes 75.6%. If we allow for > 90% of the detected area to lie within the ground truth\nbox, the \ufb01nal pascal result for the liquor sequence becomes 79.1%. See Figure 3. The same\nphenomenon occurs in the box sequence, where our approach adapts to tracking the label\nat the bottom of the box. Note, this additional detection criteria has no e\ufb00ect on any other\nscores. Additional results, including failure modes as well as successful tracking where other\napproaches fail, are reported in the supplementary material, both for the case of superpixels\nand tracks.\n\nours\nP-N [13]\nPROST [18]\nMILTrack [4]\nFragTrack [1]\n\nOverall\npascal\n74.7\n37.15\n80.4\n49.2\n66.0\n\nboard\n\npascal\n92.1\n12.9\n75.0\n67.9\n67.9\n\ndistance\n\n13.7\n139.5\n39.0\n51.2\n90.1\n\npascal\n42.9*\n36.9\n90.6\n24.5\n61.4\n\nbox\n\ndistance\n\n63.7\n99.3*\n13.0\n104.6\n57.4\n\nlemming\n\nliquor\n\npascal\n88.1\n34.3\n70.5\n83.6\n54.9\n\ndistance\n\n19.4\n26.4*\n25.1\n14.9\n82.8\n\npascal\n75.6*\n64.5\n85.4\n20.6\n79.9\n\ndistance\n\n42.5*\n17.4*\n21.5\n165.1\n30.7\n\nTable 1: Comparison with recent methods on the PROST dataset. Best scores for each sequence\nand metric are shown in bold. Our method and the P-N tracker [13] do not always detect the\nobject. Ground truthed frames in which no location was reported by the method of [13] were not\ncounted into the \ufb01nal distance score. The method of [13] missed 2 detections on the box sequence,\n1 detection on the lemming sequence, and 80 on the liquor sequence. When our approach failed to\ndetect the object, we used the predicted bounding box from the state of the \ufb01lter as our reported\nresult.\n4 Discussion\n\nWe have proposed an approach to robust \ufb01ltering embedding a multiple instance learn-\ning SVM within a \ufb01ltering framework, and iteratively performing regression (\ufb01ltering) and\nclassi\ufb01cation (inlier selection) in hope of reaching an approximate estimate of the domi-\nnant mode of the posterior for the case where other modes are due to outlier processes in\nthe measurements. We emphasize that our approach comes with no provable properties or\nguarantees, other than for the trivial case when the dynamics are linear, the inlier-outlier\nsets are linearly separable, the noises are Gaussian, zero-mean, IID white and independent\nwith known covariance, and when the initial inlier set is known to include all inliers but is\nnot necessarily pure. In this case, the method proposed converges to the conditional mean\nof the posterior p(x(t)|{y(k)}t\nk=1). However, we have provided empirical validation of our\napproach on challenging visual tracking problems, where it exceeds the state of the art, and\nillustrated some of its failure modes.\n\n8\n\n\fAcknowledgment: Research\nN000141110863, and DARPA FA8650-11-1-7156.\n\nsupported\n\nby AFOSR FA9550-09-1-0427, ONR\n\nReferences\n\n[1] A. Adam, E. Rivlin, and I. Shimshoni. Robust fragments-based tracking using the integral\n\nhistogram. In Proc. CVPR, 2006.\n\n[2] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance\n\nlearning. In Proc. NIPS, 2003.\n\n[3] S. Avidan. Ensemble tracking. PAMI, 29:261\u2013271, 2007.\n\n[4] B. Babenko, M.-H. Yang, and S. Belongie. Visual tracking with online multiple instance\n\nlearning. In Proc. CVPR, 2009.\n\n[5] Y. Bar-Shalom and X.-R. Li. Estimation and tracking: principles, techniques and software.\n\nYBS Press, 1998.\n\n[6] A. Doucet, N. de Freitas, and N. Gordon. Sequential monte carlo methods in practice. Springer\n\nVerlag, New York, 2001.\n\n[7] J. Fan, X. Shen, and Y. Wu. Closed-loop adaptation for robust tracking. In Proc. ECCV,\n\n2010.\n\n[8] P. Felzenszwalb, D. Girshick, D. McAllester, and D. Ramanan. Object detection with discrim-\n\ninatively trained part based models. In PAMI, 2010.\n\n[9] L. El Ghaoui and G. Cala\ufb01ore. Robust \ufb01ltering for discrete-time systems with structured\n\nuncertainty. In IEEE Transactions on Automatic Control, 2001.\n\n[10] H. Grabner, C. Leistner, and H. Bischof. Semi-supervised on-line boosting for robust tracking.\n\nIn Proc. ECCV, 2008.\n\n[11] P.J. Huber. Robust Statistics. Wiley, New York, 1981.\n\n[12] J. Jackson, A. J. Yezzi, and S. Soatto. Dynamic shape and appearance modeling via moving\n\nand deforming layers. IJCV, 79(1):71\u201384, August 2008.\n\n[13] Z. Kalal, J. Matas, and K. Mikolajczyk. P-n learning: Bootstrapping binary classi\ufb01ers by\n\nstructural constraints. In Proc. CVPR, 2010.\n\n[14] H. Li and M. Fu. A linear matrix inequality approach to robust h\u221e \ufb01ltering. IEEE Transactions\n\non Signal Processing, 45(9):2338\u20132350, September 1997.\n\n[15] H. Lim, V. Morariu, O. Camps, and M. Sznaier. Dynamic appearance modeling for human\n\ntracking. In Proc. CVPR, 2006.\n\n[16] J. Liu. Monte carlo strategies in scienti\ufb01c computing. SPringer Verlag, 2001.\n\n[17] X. Ren and J. Malik. Tracking as repeated \ufb01gure/ground segmentation. In Proc. CVPR, 2007.\n\n[18] J. Santner, C. Leistner, A. Sa\ufb00ari, T. Pock, and H. Bischof. PROST Parallel Robust Online\n\nSimple Tracking. In Proc. CVPR, 2010.\n\n[19] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver\n\nfor svm. In Proc. ICML, 2007.\n\n[20] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for struc-\n\ntured and interdependent output variables. JMLR, 6:1453\u20131484, September 2005.\n\n[21] A. Vedaldi and A. Zisserman. E\ufb03cient additive kernels via explicit feature maps. In Proc.\n\nCVPR, 2010.\n\n[22] Z. Yin and R. T. Collins. Shape constrained \ufb01gure-ground segmentation and tracking. In Proc.\n\nCVPR, 2009.\n\n9\n\n\f", "award": [], "sourceid": 274, "authors": [{"given_name": "Kamil", "family_name": "Wnuk", "institution": null}, {"given_name": "Stefano", "family_name": "Soatto", "institution": null}]}