{"title": "Fusion with Diffusion for Robust Visual Tracking", "book": "Advances in Neural Information Processing Systems", "page_first": 2978, "page_last": 2986, "abstract": "A weighted graph is used as an underlying structure of many algorithms like semi-supervised learning and spectral clustering. The edge weights are usually deter-mined by a single similarity measure, but it often hard if not impossible to capture all relevant aspects of similarity when using a single similarity measure. In par-ticular, in the case of visual object matching it is beneficial to integrate different similarity measures that focus on different visual representations. In this paper, a novel approach to integrate multiple similarity measures is pro-posed. First pairs of similarity measures are combined with a diffusion process on their tensor product graph (TPG). Hence the diffused similarity of each pair of ob-jects becomes a function of joint diffusion of the two original similarities, which in turn depends on the neighborhood structure of the TPG. We call this process Fusion with Diffusion (FD). However, a higher order graph like the TPG usually means significant increase in time complexity. This is not the case in the proposed approach. A key feature of our approach is that the time complexity of the dif-fusion on the TPG is the same as the diffusion process on each of the original graphs, Moreover, it is not necessary to explicitly construct the TPG in our frame-work. Finally all diffused pairs of similarity measures are combined as a weighted sum. We demonstrate the advantages of the proposed approach on the task of visual tracking, where different aspects of the appearance similarity between the target object in frame t and target object candidates in frame t+1 are integrated. The obtained method is tested on several challenge video sequences and the experimental results show that it outperforms state-of-the-art tracking methods.", "full_text": "Fusion with Diffusion for Robust Visual Tracking\n\nYu Zhou1\u2217, Xiang Bai1, Wenyu Liu1, Longin Jan Latecki2\n\n1 Dept. of Electronics and Information Engineering, Huazhong Univ. of Science and Technology, P. R. China\n\n2 Dept. of Computer and Information Sciences, Temple Univ., Philadelphia, USA\n\n{zhouyu.hust,xiang.bai}@gmail.com,liuwy@hust.edu.cn,latecki@temple.edu\n\nAbstract\n\nA weighted graph is used as an underlying structure of many algorithms like semi-\nsupervised learning and spectral clustering. If the edge weights are determined by\na single similarity measure, then it hard if not impossible to capture all relevant\naspects of similarity when using a single similarity measure.\nIn particular, in\nthe case of visual object matching it is bene\ufb01cial to integrate different similarity\nmeasures that focus on different visual representations.\nIn this paper, a novel approach to integrate multiple similarity measures is pro-\nposed. First pairs of similarity measures are combined with a diffusion process on\ntheir tensor product graph (TPG). Hence the diffused similarity of each pair of ob-\njects becomes a function of joint diffusion of the two original similarities, which\nin turn depends on the neighborhood structure of the TPG. We call this process\nFusion with Diffusion (FD). However, a higher order graph like the TPG usually\nmeans signi\ufb01cant increase in time complexity. This is not the case in the proposed\napproach. A key feature of our approach is that the time complexity of the dif-\nfusion on the TPG is the same as the diffusion process on each of the original\ngraphs. Moreover, it is not necessary to explicitly construct the TPG in our frame-\nwork. Finally all diffused pairs of similarity measures are combined as a weighted\nsum. We demonstrate the advantages of the proposed approach on the task of\nvisual tracking, where different aspects of the appearance similarity between the\ntarget object in frame t \u2212 1 and target object candidates in frame t are integrat-\ned. The obtained method is tested on several challenge video sequences and the\nexperimental results show that it outperforms state-of-the-art tracking methods.\n\nIntroduction\n\n1\nThe considered problem has a simple formulation: Given are multiple similarities between the same\nset of n data points, each similarity can be represented as a weighted graph. The goal is to combine\nthem to a single similarity measure that best re\ufb02ects the underlying data manifold. Since the set of\nnodes is the same, it is easy to combine the graphs into a single weighted multigraph, where there\nare multiple edges between the same pair of vertices representing different similarities. Then our\ntask can be stated as \ufb01nding a mapping from the multigraph to a weighted simple graph whose edge\nweights best represent the similarity of the data points. Of course, this formulation is not precise,\nsince generally the data manifold is unknown, and hence it is hard to quantify the \u2019best\u2019. However,\nit is possible to evaluate the quality of the combination experimentally in many applications, e.g.,\nthe tracking performance considered in this paper.\nThere are many possible solutions to the considered problem. One of the most obvious ones is a\nweighted linear combination of the similarities. However, this solution does not consider the simi-\nlarity dependencies of different data points. The proposed approach aims to utilize the neighborhood\nstructure of the multigraph in the mapping to the weighted simple graph.\n\u2217Part of this work was done while the author was visiting Temple University\n\n1\n\n\fGiven two different similarity measures, we \ufb01rst construct their Tensor Product Graph (TPG). Then\nwe jointly diffuse both similarities with a diffusion process on TPG. However, while the original\ngraphs representing the two measures have n nodes, their TPG has n2 nodes, which signi\ufb01cantly\nincreases the time complexity of the diffusion on TPG. To address this problem, we introduce an\niterative algorithm that operates on the original graphs and prove that it is equivalent to the diffusion\non TPG. We call this process Fusion with Diffusion (FD). FD is a generalization of the approached\nin [26], where only a single similarity measure is considered. While the diffusion process on TPG in\n[26] is used to enhances a single similarity measure, our approach aims at combining two different\nsimilarity measures so that they enhance and constrain each others.\nAlthough algorithmically very different, our motivation is similar to co-training style algorithms in\n[5, 23, 24] where multiple cues are fused in an iterative learning process. The proposed approach is\nalso related to the semi-supervised learning in [6, 7, 21, 28, 29]. For online tracking task, we only\nhave the label information from the current frame, which can be regarded as the labeled data, and\nthe label information in the next frame is unavailable, which can be regarded as unlabeled data. In\nthis context, FD jointly propagates two similarities of the unlabeled data to the labeled data. The\nobtained new diffused similarity, can be then interpreted as the label probability over the unlabeled\ndata. Hence from the point of view of visual tracking, but in the spirit of semi-supervised learning,\nour approach utilizes the unlabeled data from the next frame for improved visual similarity to the\nlabeled data representing the tracked objets.\nVisual tracking is an important issue in computer vision and has many practical applications. The\nchallenges in designing a tracking system are often caused by shape deformation, occlusion, view-\npoints variances, and background clutter. Different strategies have been proposed to obtain robust\ntracking systems. In [8, 12, 14, 16, 25, 27], matching based strategy is utilized. Discriminate ap-\npearance model of the target is extracted from the current frame, then the optimal target is estimated\nbased on the distance/similatity between the appearance model and the candidate in the hypothesis\nset. Classi\ufb01cation based strategies are introduced in [1, 2, 3, 4, 10, 11]. Tracking task is transformed\ninto foreground and background binary classi\ufb01cation problem in this framework. [15, 20] try to\ncombine both of those two frameworks. In this paper, we focus on improving the distance/similarity\nmeasure to improve the matching based tracking strategy. Our motivation is similar to [12], where\nmetric learning is proposed to improve the distance measure. However, different from [12], multiple\ncues are fused to improve the similarity in our approach. Moreover, the information from the forth-\ncoming frame is also used to improve the similarity. This leads to more stable tracking performance\nthan in [12].\nMultiple cues fusion seem to be an effective way to improve the tracking performance. In [13],\nmultiple feature fusion is implemented based on sampling the state space. In [20], the tracking task\nis formulated as the combination of different trackers, three different trackers are combined into a\ncascade. Different from those methods, we combine different similarities into a single similarity\nmeasure, which makes our method a more general for integrating various appearance models.\nIn summary, we propose a novel framework for integration of multiple similarity measures into a\nsingle consistent similarity measure, where the similarity of each pair of data points depends on their\nsimilarity to other data points. We demonstrate its superior performance on a challenging task of\ntracking by visual matching.\n\n2 Problem Formulation\n\nThe problem of matching based visual tracking boils down to the following simple formulation.\nGiven the target in frame It\u22121 which can be represented as image patch I1 enclosing the target, and\nthe set of candidate target patches in frame It, C = {In| n = 2, ..., N}, the goal is to determine\nwhich patch in C corresponds to the target in frame It\u22121. Of course, one can make this setting more\ncomplicated, e.g., by considering more frames, but we consider this simple formulation in this paper.\nThe candidate set C is determined by the motion model, which is particularly simple in our setting.\nThe size of all the image patches is \ufb01xed and the candidate set is composed of patches in frame It\ninside a search radius r, i.e. ||c(In) \u2212 c(I1)|| < r, where c is the 2-D coordinate of center position\nof the image patch.\n\n2\n\n\f\u02c6I = arg max\n\nX\u2208C S(I1, X)\n\nLet S be a similarity measure de\ufb01ned on the set of the image patches V = {I1} \u222a C, i.e., S is a\nfunction from V \u00d7 V into positive real numbers. Then our tracking goal can be formally stated as\n(1)\nmeaning that the patch in C with most similar appearance to patch I1 is selected as the target location\nin frame t.\nSince the appearance of the target object changes, e.g., due to motion and lighting changes, single\nsimilarity measure is often not suf\ufb01cient to identify the target in the next frame. Therefore, we con-\nsider a set of similarity measures S = {S1, . . . SQ}, each S\u03b1 de\ufb01ned on V \u00d7 V for \u03b1 = 1, . . . , Q.\nFor example, in our experimental results, each image patch is represented with three histograms\nbased on three different features, HOG[9], LBP[18], Haar-like feature[4], which lead to three differ-\nent similarity measures. In other words, each pair of patches can be compared with respect to three\ndifferent appearance features.\nWe can interpret each similarity measure S\u03b1 as the af\ufb01nity matrix of a graph G\u03b1 whose vertex set\nis V , i.e., S\u03b1 a N \u00d7 N matrix with positive entries, where N is the cardinality of V . Then we\ncan combine the graphs G\u03b1 into a single multigraph whose edge weights corresponds to different\nsimilarity measures S\u03b1.\nHowever, in order to solve Eq. (1), we need a single similarity measure S. Hence we face a question\nhow to combine the measures in S into a single similarity measure. We propose a two stage approach\nto answer this question. First, we combine pairs of similarity measures S\u03b1 and S\u03b2 into a single\n\u03b1,\u03b2, which is a matrix of size N \u00d7 N. P\u2217\nmeasure P\u2217\n\u03b1,\u03b2 is de\ufb01ned in Section 3 and it is obtained with\nthe proposed process called fusion with diffusion.\n(cid:88)\nIn the second stage we combine all P\u2217\nde\ufb01ned as a weighted matrix sum\n\n\u03b1,\u03b2 for \u03b1, \u03b2 = 1, . . . Q into a single similarity measure S\n\n(2)\n\n\u03c9\u03b1\u03c9\u03b2P\u2217\n\n\u03b1,\u03b2\n\nS =\n\n\u03b1,\u03b2\n\nwhere \u03c9\u03b1 and \u03c9\u03b2 are positive weights associated with measures S\u03b1 and S\u03b2 de\ufb01ned in Section 5.\nWe also observe that in contrast to many tracking by matching methods, the combined measure S is\nnot only a function of similarities between I1 and the candidate patches in C, but also of similarities\nof patches in C to each other.\n3 Fusion with Diffusion\n3.1 Single Graph on Consecutive Frames\nGiven a single graph G\u03b1 = (V, S\u03b1), a reversible Markov chain on V can be constructed with the\ntransition probability de\ufb01ned as\n\nwhere Di = (cid:80)N\ninherits the positivity-preserving property(cid:80)N\n\n(3)\nj=1 S\u03b1(i, j) is the degree of each vertex. Then the transition probability P\u03b1(i, j)\n\nP\u03b1(i, j) = S\u03b1(i, j)/Di\n\nj=1 P\u03b1(i, j) = 1, i = 1, ..., N.\n\nThe graph G\u03b1 is fully connected graph in many applications. To reduce the in\ufb02uence of noisy points,\ni.e., cluttered background patches in tracking, a local transition probability is used:\n\n(cid:26) P\u03b1(i, j)\n\n0\n\n(Pk,\u03b1)(i, j) =\n\nj \u2208 kNN(i)\notherwise\n\n(4)\n\n(cid:80)n\n\nHence the number of non-zero elements in each row is not\n\nlarger than k, which implies\nj=1(Pk,\u03b1)(i, j) < 1. This inequality is important in our framework, since it guarantees the\n\nconverge of the diffusion process on the tensor product graph presented in the next section.\n3.2 Tensor Product Graph of Two Similarities\nGiven are two graphs G\u03b1 = (V, Pk,\u03b1) and G\u03b2 = (V, Pk,\u03b2) de\ufb01ned in Sec. 3.1, we can de\ufb01ne their\nTensor Product Graph (TPG) as\n\nG\u03b1 \u2297 G\u03b2 = (V \u00d7 V, P),\n\n(5)\n\n3\n\n\fwhere P = Pk,\u03b1 \u2297 Pk,\u03b2 is the Kronecker product of matrices de\ufb01ned as P(a, b, i, j) =\nPk,\u03b1(a, b) Pk,\u03b2(i, j). Thus, each entry of P relates four image patches. When Pk,\u03b1 and Pk,\u03b2 are\ntwo N \u00d7 N matrices, then P is a N 2 \u00d7 N 2 matrix. However, as we will see in the next subsection,\nwe actually never compute P explicitly.\n3.3 Diffusion Process on Tensor Product Graph\nWe utilize a diffusion process on TPG to combine the two similarity measures Pk,\u03b1 and Pk,\u03b2. We\nbegin with some notations. The vec operator creates a column vector from a matrix M by stacking\nthe column vectors of M below one another. More formally vec : RN\u00d7N \u2192 RN 2 is de\ufb01ned as\nvec(M )g = (M )ij, where i = (cid:98)(g \u2212 1)/N(cid:99) + 1 and j = g mod N. The inverse operator vec\u22121\nthat maps a vector into a matrix is often called the reshape operator. We de\ufb01ne a diagonal N \u00d7 N\nmatrix as\n\nOnly the entry representing the patch I1 is set to one and all other entries are set to zero in \u2206.\nWe observe that P is the adjacency matrix of TPG G\u03b1 \u2297 G\u03b2. We de\ufb01ne a q-th iteration of the\ndiffusion process on this graph as\n\n\u2206(i, i) =\n\ni = 1\notherwise,\n\n(cid:26) 1\n\n0\n\nq(cid:88)\n\ne=0\n\n(P)evec(\u2206).\n\nq(cid:88)\n\ne=0\n\nlim\nq\u2192\u221e\n\n(P)evec(\u2206) = (I \u2212 P)\u22121vec(\u2206),\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\nAs shown in [26], this iterative process is guaranteed to converge to a nontrivial solution given by\n\nwhere I is a identity matrix. Following [26], we de\ufb01ne\n\n\u03b1,\u03b2 = P\u2217 = vec\u22121((I \u2212 P)\u22121vec(\u2206))\nP\u2217\n\nWe observe that our solution P\u2217 is a N \u00d7 N matrix.\nWe call the diffusion process to compute P\u2217 a Fusion with Diffusion (FD) process, since diffusion\non TPG G\u03b1 \u2297 G\u03b2 is used to fuse two similarity measures S\u03b1 and S\u03b2.\nSince P is a N 2 \u00d7 N 2 matrix, FD process on TPG as de\ufb01ned in Eq. (7) may be computationally too\ndemanding. To compute P\u2217 effectively, instead of diffusing on TPG directly, we show in Section 3.4\nthat FD process on TPG is equivalent to an iterative process on N \u00d7 N matrices only. Consequently,\ninstead of an O(n6) time complexity, we obtain an O(n3) complexity. Then in Section 4 we further\nreduce it to an ef\ufb01cient algorithm with time complexity O(n2), which can be used in real time\ntracking algorithms.\n3.4\nWe de\ufb01ne P1 = P(k,\u03b1)P T\n\nIterative Algorithm for Computing P\u2217\n\n(k,\u03b2) and\n\nPq+1 = Pk,\u03b1(Pk,\u03b1)q(P T\n\nk,\u03b2)qP T\n\nk,\u03b2 + \u2206.\n\nWe iterate Eq.(10) until convergence, and as we prove in Proposition 1, we obtain\n\nP\u2217= lim\n\nq\u2192\u221e Pq=vec\u22121((I \u2212 P)\u22121vec(\u2206))\n\n(10)\n\n(11)\n\nThe iterative process in Eq.(10) is a generalization of the process introduced in [26]. Consequently,\nthe following properties are simple extensions of the properties derived in [26]. However, we state\nthem explicitly, since we combine two different af\ufb01nity matrices, while [26] considers only a single\nmatrix. In other words, we consider diffusion on TPG of two different graphs, while diffusion on\nTPG of a single graph with itself is considered in [26].\nProposition 1\n\n(cid:18)\n\n(cid:19)\n\nvec\n\nq\u2192\u221e P(q+1)\nlim\n\n= lim\nq\u2192\u221e\n\nPevec(\u2206) = (I \u2212 P)\u22121vec(\u2206).\n\n(12)\n\nq\u22121(cid:88)\n\ne=0\n\n4\n\n\fProof: Eq.(10) can be rewritten as\n\nP(q+1) = Pk,\u03b1 (Pk,\u03b1)q(P T\n\nk,\u03b2)q P T\n= Pk,\u03b1[Pk,\u03b1 (Pk,\u03b1)(q\u22121)(P T\n= (Pk,\u03b1)2 (Pk,\u03b1)(q\u22121)(P T\n= \u00b7\u00b7\u00b7\n= (Pk,\u03b1)q Pk,\u03b1P T\n\nk,\u03b2 (P T\n\nk,\u03b2 + \u2206\nk,\u03b2)(q\u22121) P T\nk,\u03b2)(q\u22121) (P T\nq\u22121(cid:88)\n\nk,\u03b2 + \u2206]P T\nk,\u03b2)2 + Pk,\u03b1 \u2206 Pk,\u03b2 + \u2206\n\nk,\u03b2 + \u2206\n\nk,\u03b2)q + (Pk,\u03b1)q\u22121 \u2206 (P T\n\nk,\u03b2)q\u22121 + \u00b7\u00b7\u00b7 + \u2206\n\n= (Pk,\u03b1)q Pk,\u03b1P T\n\nk,\u03b2 (P T\n\nk,\u03b2)q +\n\n(Pk,\u03b1)e \u2206 (P T\n\nk,\u03b2)e\n\n(13)\n\nLemma 1 limq\u2192\u221e(Pk,\u03b1)q Pk,\u03b1P T\nk,\u03b2)q go to 0, when q \u2192 \u221e. This is true if and only\nProof: It suf\ufb01ces to show that (Pk,\u03b1)q and (P T\nif every eigenvalue of Pk,\u03b1 and Pk,\u03b2 is less than one in absolute value. Since Pk,\u03b1 and Pk,\u03b2 has\nnonnegative entries, this holds if its row sums are all less than one. As described in Sec.3.1, we have\n\nk,\u03b2)q = 0\n\nk,\u03b2 (P T\n\n(cid:80)N\nb=1(Pk,\u03b1)a,b < 1 and(cid:80)N\n\nj=1(Pk,\u03b2)i,j < 1.\n\nLemma 1 shows that the \ufb01rst summand in Eq.(13) converges to zero, and consequently we have\n\ne=0\n\nq\u22121(cid:88)\n\nis true for e = l,\n\n(14)\n\n(15)\n\n(16)\n\n(17)\n\n(18)\n\n(Pk,\u03b1)e \u2206 (P T\n\nLemma 2 vec\n\nq\u2192\u221e P(q+1) = lim\nlim\nq\u2192\u221e\n\nk,\u03b2)e(cid:17)\n\n(cid:16)\nk,\u03b2)l(cid:17)\n(P)l+1vec(\u2206) = P Plvec(\u2206) = vec(cid:0)Pk,\u03b1 vec\u22121(Plvec(\u2206)) P T\n\n= (P)evec(\u2206) for e = 1, 2, . . . .\n\nProof: Our proof is by induction. Suppose (P)lvec(\u2206)=vec\nthen for e = l + 1 we have\n\n(Pk,\u03b1)e \u2206 (P T\n\n(Pk,\u03b1)l \u2206 (P T\n\nk,\u03b2)e.\n\ne=0\n\n(cid:1)\n\nk,\u03b2\n\n(cid:16)\n= vec(cid:0)Pk,\u03b1 ((Pk,\u03b1)l \u2206 (P T\nk,\u03b2)l+1(cid:1)\n= vec(cid:0)(Pk,\u03b1)l+1 \u2206 (P T\n(cid:33)\n(cid:32)q\u22121(cid:88)\nq\u22121(cid:88)\n\n(Pk,\u03b1)e \u2206 (P T\n\nk,\u03b2)e\n\n=\n\ne=0\n\ne=0\n\n(cid:1)\n\nk,\u03b2)l) P T\nk,\u03b2\n\n(P)evec(\u2206).\n\nand the proof of Lemma 2 is complete.\nBy Lemma 1 and Lemma 2, we obtain that\n\nThe following useful identity holds for the Kronecker Product [22]:\n\nvec\n\n(cid:18)\n\nq\u22121(cid:88)\n\ne=0\n\n= lim\nq\u2192\u221e\n\nvec(Pk,\u03b2\u2206P T\n\nk,\u03b1) = (Pk,\u03b1 \u2297 Pk,\u03b2)vec(\u2206) = (P)vec(\u2206)\n(cid:33)\n(cid:19)\n\n(cid:32)\n\nvec\n\nq\u2192\u221e P(q+1)\nlim\n\n= vec\n\nlim\nq\u2192\u221e\n\n(Pk,\u03b1)e \u2206 (P T\n\nk,\u03b2)e\n\nq\u22121(cid:88)\n\ne=0\n\nPevec(\u2206) = (I \u2212 P)\u22121vec(\u2206)=vec(P\u2217).\n\nPutting together (14), (15), (16), we obtain\n\nThis proves Proposition 1.\nWe now show how FD could improve the original similarity measures. Suppose we have two simi-\nlarity measures S\u03b1 and S\u03b2. I1 denotes the image patch enclosing the target in frame t\u22121. According\nto S\u03b1, there are many patches in frame t that have nearly equal similarity to I1 with patch In be-\ning most similar to I1, while according to S\u03b2, I1 is clearly more similar to Im in frame t. Then\nthe proposed diffusion will enhance the similarity S\u03b2(I1,Im), since it will propagate faster the S\u03b2\nsimilarity of I1 to Im than to the other patches. In contrast, the S\u03b1 similarities will propagate with\nsimilar speed. Consequently, the \ufb01nal joint similarity P\u2217 will have Im as the most similar to I1.\n\n5\n\n\fAlgorithm 1: Iterative Fusion with Diffusion Process\nInput: Two matrices Pk,\u03b1, Pk,\u03b2 \u2208 RN\u00d7N\nOutput: Diffusion result P\u2217 \u2208 RN\u00d7N\n\n1 Compute P\u2217 = \u2206.\n2 Compute u\u03b1 = \ufb01rst column of Pk,\u03b1, u\u03b2 = \ufb01rst column of Pk,\u03b2\n3 Compute P\u2217 \u2190 P\u2217 + u\u03b1uT\n\u03b2 .\n4 for i = 2, 3, . . . do\nCompute u\u03b1 \u2190 Pk,\u03b1u\u03b1\n5\nCompute u\u03b2 \u2190 Pk,\u03b2u\u03b2\n6\nCompute P\u2217 \u2190 P\u2217 + u\u03b1uT\n7\n8 end\n\n\u03b2\n\n4 FD Algorithm\nTo effectively compute P\u2217, we propose an iterative algorithm that takes the advantage of the structure\nof matrix \u2206. Let u\u03b1 be a N \u00d7 1 vector containing the \ufb01rst column of Pk,\u03b1. We write Pk,\u03b1 = [u\u03b1|R]\nand Pk,\u03b1\u2206 = [u\u03b1|0].\n\u03b2 . Furthermore, if we denote\n(Pk,\u03b1)j \u2206 (P T\n\nIt follows then that Pk,\u03b1 \u2206 P T\n\nk,\u03b2 = u\u03b1uT\n\nk,\u03b2)j = u\u03b1,juT\n\n\u03b2,j, with u\u03b1,j being N \u00d7 1, and uT\nk,\u03b2)j)P T\n\nk,\u03b2)j+1 = Pk,\u03b1(P j\n\nk,\u03b1 \u2206 (P T\n\nP j+1\nk,\u03b1 \u2206 (P T\n\n\u03b2,j being 1 \u00d7 N, it follows that\nk,\u03b2 = Pk,\u03b1u\u03b1,juT\n\u03b2,j+1.\n\n\u03b2,jP T\nk,\u03b2\n\n= (Pk,\u03b1u\u03b1,j)(Pk,\u03b2u\u03b2,j)T = u\u03b1,j+1uT\n\nHence, we replaced one of the two N \u00d7 N matrix products with one matrix product between an\nN \u00d7 N matrix and N \u00d7 1 vector, and the other with a product of an N \u00d7 1 by an 1\u00d7 N vector. This\nreduces the complexity of our algorithm from O(n3) to O(n2).\nThe \ufb01nal algorithm is shown in Alg. 1.\n\n5 Weight Estimation\nThe weight \u03c9\u03b1 of measure S\u03b1 is proportional to how well S\u03b1 is able to distinguish the target I1\nin frame It\u22121 from the background surrounding the target. Let {Bh| h = 1, ..., H} be a set of\nbackground patches surrounding the target I1 in frame It\u22121. The weight of S\u03b1 is de\ufb01ned as\n\n(cid:80)H\nh=1 S\u03b1(I1,Bh)\n\n1\n\n\u03c9\u03b1 =\n\n1\nH\n\n(19)\n\nare normalized so that (cid:80)Q\n\nThus, the smaller the values of S\u03b1, the larger is the weight \u03c9\u03b1. The weights of all similarity measures\n\u03b1=1 \u03c9\u03b1 = 1. The weights are computed for every frame in order to\n\naccommodate appearance changes of the tracked object.\n\n6 Experimental Results\nWe validate our tracking algorithm on eight challenging videos from [4] and [17]: Sylvester, Coke\nCan, Tiger1, Cliff Bar, Coupon Book, Surfer, and Tiger2, PETS01D1. We compare our method with\nsix famous state-of-the-art tracking algorithms including Multiple Instance Learning tracker (MIL)\n[4], Fragment tracker(Frag) [1], IVT [19], Online Adaboost tracker (OAB) [10], SemiBoost tracker\n(Semi) [11], Mean-Shift (MS) tracker, and a simple weighted linear sum of multiple cues (Linear).\nFor the comparison methods, we run source code of Semi, Frag, MIL, IVT and OAB supplied by the\nauthors on the testing videos and use the parameters mentioned in their papers directly. For MS, we\nimplement it based on OpenCV. For Linear, we use three kinds of image features to get the af\ufb01nity\nand then simply calculate the average af\ufb01nity and use the diffusion process mentioned in [26]. Note\nthat all the parameters in our algorithm were \ufb01xed for all the experiments.\nIn our experiments, HOG[9], LBP[18] and Haar-like[4] features are used to represent the image\npatches. Hence each pair of patches is compared with three different similarities based on histograms\n\n6\n\n\fFigure 1: Center Location Error (CLE) versus frame number\n\nof HOG, LBP, and Haar-like feature. For the experimental parameters, we set r = 15 pixels,\nH = 300, k = 12 and the iteration number in Alg. 1 is set to 200.\nTo impartially and comprehensively compare our algorithm with other state-of-the-art trackers, we\nused two kinds of quantitative comparisons Average Center Location Error (ACLE) and Precision\nScore [4]. The results are shown in Table 1 and Table 2, respectively. Two kinds of curve evaluation\nmethodologies are also used Center Location Error (CLE) curve and Precision Plots curve1. The\nresults are shown in Fig.1 and Fig.2, respectively.\n\nTable 1: Average Center Location Error (ACLE measured in pixels). Red color indicates best\nperformance, Blue color indicates second best, Green color indicates the third best\n\nVideo\n\nCoke Can\nCliff Bar\nTiger 1\nTiger2\n\nCoup. Book\nSylvester\nSurfer\n\nPETS01D1\n\nMS OAB\n43.7\n25.0\n34.6\n43.8\n39.8\n45.5\n13.2\n47.6\n17.7\n20.0\n20.0\n35.0\n13.4\n17.0\n18.1\n7.1\n\nIVT\n37.3\n47.1\n50.2\n98.5\n32.2\n96.1\n19.0\n241.8\n\nSemi\n40.5\n57.2\n20.9\n39.3\n65.1\n21.0\n9.3\n158.9\n\nFrag1\n69.1\n34.7\n39.7\n38.6\n55.9\n23.0\n140.1\n6.7\n\nFrag2\n69.0\n34.0\n26.7\n38.8\n56.1\n12.2\n139.8\n7.2\n\nFrag3 MIL Linear\n34.1\n16.8\n15.0\n44.8\n23.8\n31.1\n6.5\n51.9\n13.6\n67.0\n10.1\n10.5\n6.5\n138.6\n9.5\n245.4\n\n31.9\n14.2\n7.6\n20.6\n19.8\n11.4\n7.7\n11.7\n\nour\n15.4\n6.1\n6.9\n5.7\n6.5\n9.3\n5.5\n6.0\n\nTable 2: Precision Score (precision at the \ufb01xed threshold of 15). Red color indicates best perfor-\nmance, Blue color indicates second best, Green color indicates the third best.\n\nVideo\n\nCoke Can\nCliff Bar\nTiger 1\nTiger 2\n\nCoupon Book\n\nSylvester\nSurfer\n\nPETS01D1\n\nMS OAB IVT\n0.11\n0.15\n0.19\n0.08\n0.03\n0.05\n0.01\n0.06\n0.21\n0.16\n0.46\n0.06\n0.40\n0.59\n0.38\n0.01\n\n0.21\n0.21\n0.17\n0.65\n0.18\n0.30\n0.61\n1.00\n\nSemi\n0.18\n0.34\n0.52\n0.44\n0.41\n0.53\n0.89\n0.29\n\nFrag1\n0.09\n0.20\n0.21\n0.09\n0.39\n0.72\n0.19\n0.99\n\nFrag2\n0.09\n0.23\n0.38\n0.09\n0.39\n0.78\n0.21\n0.97\n\nFrag3 MIL Linear\n0.17\n0.36\n0.52\n0.12\n0.54\n0.38\n0.89\n0.12\n0.53\n0.39\n0.81\n0.86\n1.00\n0.23\n0.95\n0.02\n\n0.24\n0.79\n0.90\n0.66\n0.23\n0.76\n0.93\n0.80\n\nour\n0.46\n0.95\n0.91\n0.95\n1.00\n0.90\n1.00\n1.00\n\nComparison to matching based methods: MS, IVT, Frag and Linear are all matching based\ntracking algorithms. In MS, famous Bhattacharyya coef\ufb01cient is used to measure the distance be-\ntween histogram distributions; for Frag, we test it under three different measurement strategies: the\n\n1More details about the meaning of Precision Plots can be found in [4]\n\n7\n\n050100150200250300020406080100120140Coke CanFrame #Center Location Error (pixel)  MSFrag(KS)Frag(EMD)Frag(Chi)IVTLinearOur050100150200250300350050100150Cliff BarFrame #Center Location Error (pixel)  MSFrag(KS)Frag(EMD)Frag(Chi)IVTLinearOur050100150200250300350020406080100120140160180Coupon BookFrame #Center Location Error (pixel)  MSFrag(KS)Frag(EMD)Frag(Chi)IVTLinearOur0200400600800100012001400050100150200250SylvesterFrame #Center Location Error (pixel)  MSFrag(KS)Frag(EMD)Frag(Chi)IVTLinearOur050100150200250300350400050100150200250300350400450SurferFrame #Center Location Error (pixel)  MSFrag(KS)Frag(EMD)Frag(Chi)IVTLinearOur050100150200250300350400020406080100120140Tiger1Frame #Center Location Error (pixel)  MSFrag(KS)Frag(EMD)Frag(Chi)IVTLinearOur050100150200250300350400020406080100120140160180Tiger2Frame #Center Location Error (pixel)  MSFrag(KS)Frag(EMD)Frag(Chi)IVTLinearOur0501001502002503003504004500100200300400500600PETS01D1Frame #Center Location Error (pixel)  MSFrag(KS)Frag(EMD)Frag(Chi)IVTLinearOur\fFigure 2: Precision Plots. The threshold is set to 15 in our experiments.\n\nKolmogorov-Smirnov statistic, EMD, and Chi-Square distance, represented as Frag1, Frag2, Frag3\nin Table 1 and Table 2, respectively. For Linear Combination, the average similarity is used and\nthe diffusion process in [26] is used to improve the similarity measure. Our FD approach clearly\noutperforms the other approaches, as shown in Table1 and Table2. Our tracking results achieve the\nbest performance in all the testing videos, especially for the Precision Plots shown in Table 2. Even\nthough we set the threshold to 15, which is more challenging for all the trackers, we still get three\n1.00 scores. In some videos like sylvester and PETS01D1, Frag achieves comparable results with\nour method, but it works badly in other videos which means that speci\ufb01c distance measure can only\nwork on some special cases but our fusion framework is robust for all the challenges that appear in\nthe videos. Our method is always batter than Linear Combination, which means that the fusion with\ndiffusion can really improve the tracking performance. The stability of our method can be clearly\nseen in the plots of location error as the function of frame number in Fig.1. Our tracking results\nare always stable, which means that we do not lose the target in the whole tracking process. This is\nalso re\ufb02ected in the fact that our Precision is always batter than all the other methods under different\nthresholds as shown in Fig.2.\nComparison to classi\ufb01cation based methods: MIL and OAB are both classi\ufb01cation based tracking\nalgorithms. For OAB, on-line Adaboost is used to train the classi\ufb01er for the foreground and back-\nground classi\ufb01cation. MIL combines multiple instance learning with on-line Adaboost. Haar-like\nfeatures are used in both methods. Again our method outperforms those two methods as can be seen\nin Table1 and Table 2.\nComparison to semi-supervised learning based methods: SemiBoost combines semi-supervised\nlearning with on-line Adaboost. Our method is also similar to semi-supervised learning for we build\nthe graph model on consecutive frames, which means that both of our method and SemiBoost use\nthe information from the forthcoming frame. Our method is always better than SemiBoost as shown\nin Table 1 and Table 2.\n\n7 Conclusions\nIn this paper, a novel Fusion with Diffusion process is proposed for robust visual tracking. Pairs\nof similarity measures are fused into a single similarity measure with a diffusion process on the\ntensor product of two graphs determined by the two similarity measures. The proposed method has\ntime complexity of O(n2), which makes it suitable for real time tracking. It is evaluated on sev-\neral challenging videos, and it signi\ufb01cantly outperforms a large number of state-of-the-art tracking\nalgorithms.\n\nAcknowledgments\nWe would like to thank all the authors for releasing their source codes and testing videos, since they\nmade our experimental evaluation possible. This work was supported by NSF Grants IIS-0812118,\nBCS-0924164, OIA-1027897, and by the National Natural Science Foundation of China (NSFC)\nGrants 60903096, 61222308 and 61173120.\n\n8\n\n0510152025303540455000.10.20.30.40.50.60.70.80.91Cliff BarThresholdPrecision  MSFrag(KS)Frag(EMD)Frag(Chi)IVTLinearOur0510152025303540455000.10.20.30.40.50.60.70.80.91Coke CanThresholdPrecision  MSFrag(KS)Frag(EMD)Frag(Chi)IVTLinearOur0510152025303540455000.10.20.30.40.50.60.70.80.91Coupon BookThresholdPrecision  MSFrag(KS)Frag(EMD)Frag(Chi)IVTLinearOur0510152025303540455000.10.20.30.40.50.60.70.80.91SylvesterThresholdPrecision  MSFrag(KS)Frag(EMD)Frag(Chi)IVTLinearOur0510152025303540455000.10.20.30.40.50.60.70.80.91SurferThresholdPrecision  MSFrag(KS)Frag(EMD)Frag(Chi)IVTLinearOur0510152025303540455000.10.20.30.40.50.60.70.80.91PETS01D1ThresholdPrecision  MSFrag(KS)Frag(EMD)Frag(Chi)IVTLinearOur0510152025303540455000.10.20.30.40.50.60.70.80.91Tiger1ThresholdPrecision  MSFrag(KS)Frag(EMD)Frag(Chi)IVTLinearOur0510152025303540455000.10.20.30.40.50.60.70.80.91Tiger2ThresholdPrecision  MSFrag(KS)Frag(EMD)Frag(Chi)IVTLinearOur\fReferences\n[1] A. Adam, E. Rivlin, and I. Shimshoni. Robust fragment-based tracking using the integral histogram. In\nIEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR), pages 798\u2013805,\n2006.\n\n[2] S. Avidan. Support vector tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence,\n\n26(8):1064\u20131072, 2004.\n\n29(2):261\u2013271, 2007.\n\n[3] S. Avidan. Ensemble tracking.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence,\n\n[4] B. Babenko, M. Yang, and S. Belongie. Robust object tracking with online multiple instance learning.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8):1619\u20131632, 2011.\n\n[5] X. Bai, B. Wang, C. Yao, W. Liu, and Z. Tu. Co-transduction for shape retrieval. IEEE Transactions on\n\nImage Processing, 21(5):2747\u20132757, 2012.\n\n[6] X. Bai, X. Yang, L. J. Latecki, W. Liu, and Z. Tu. Learning context sensitive shape similarity by graph\n\ntransduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(5):861\u2013874, 2010.\n\n[7] M. Belkin and P. Niyogi. Semi-supervised learning on riemannian manifolds. Machine Learning, 56(spe-\n\ncial Issue on clustering):209\u2013239, 2004.\n\n[8] D. Comaniciu, V. R. Member, and P. Meer. Kernel-based object tracking. IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, 25(5):564\u2013575, 2003.\n\n[9] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In IEEE Computer Society\n\nConference on Computer Vision and Pattern Recognition(CVPR), pages 886\u2013893, 2005.\n\n[10] H. Grabner, M. Grabner, and H. Bischof. Real-time tracking via on-line boosting. In British Machine\n\nVision Conference(BMVC), pages 47\u201356, 2006.\n\n[11] H. Grabner, C. Leistner, and H. Bischof. Semi-supervised on-line boosting for robust tracking. In Euro-\n\npean Conference on Computer Vision(ECCV), pages 234\u2013247, 2008.\n\n[12] N. Jiang, W. Liu, and Y. Wu. Learning adaptive metric for robust visual tracking. IEEE Transactions on\n\nImage Processing, 20(8):2288\u20132300, 2011.\n\n[13] J. Kwon and K. M. Lee. Visual tracking decomposition.\n\nIn IEEE Computer Society Conference on\n\nComputer Vision and Pattern Recognition(CVPR), 2010.\n\n[14] J. Lim, D. Ross, R.-S. Lin, and M.-H. Yang. Incremental learning for visual tracking. In Advances in\n\nNeural Information Processing Systems (NIPS), 2005.\n\n[15] R. Liu, J. Cheng, and H. Lu. A robust boosting tracker with minimum error bound in a co-training\n\nframework. In IEEE Interestial Conference on Computer Vision(ICCV), 2009.\n\n[16] X. Mei and H. Ling. Robust visual tracking and vehicle classi\ufb01cation via sparse representation. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 33(11):2259\u20132272, 2011.\n\n[17] X. Mei, H. Ling, Y. Wu, E. Blasch, and L. Bai. Minimum error bounded ef\ufb01cient l1 tracker with occlusion\n\ndetection. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2011.\n\n[18] T. Ojala, M. Pietik\u00a8ainen, and T. M\u00a8aenp\u00a8a\u00a8a. Multiresolution gray-scale and rotation invariant texture clas-\nsi\ufb01cation with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence,\n24(7):971\u2013987, 2002.\n\n[19] D. Ross, J. Kim, R.-S. Lin, and M.-H. Yang. Incremental learning for robust visual tracking. International\n\nJournal of Computer Vision, 77(1):125\u2013141, 2008.\n\n[20] J. Santner, C. Leistner, A. Saffari, T. Pock, and H. Bischof. Prost: Parallel robust online simple tracking.\n\nIn IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR), 2010.\n\n[21] K. Sinha and M.Belkin. Semi-supervised learning using sparse eigenfunction bases.\n\nIn Advances in\n\nNeural Information Processing Systems(NIPS), 2009.\n\n[22] S. Vishwanathan, N. Schraudolph, R. Kondor, and K. Borgwardt. Graph kernels. Journal of Machine\n\nLearning Research, 11(4):1201\u20131242, 2010.\n\n[23] B. Wang, J. Jiang, W. Wang, Z.-H. Zhou, and Z. Tu. Unsupervised metric fusion by cross diffusion. In\n\nIEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR), 2012.\n\n[24] W. Wang and Z. Zhou. A new analysis of co-training.\n\nIn Internal Conference on Machine Learn-\n\n[25] Y. Wu and J. Fan. Contextual \ufb02ow.\n\nIn IEEE Computer Society Conference on Computer Vision and\n\ning(ICML), 2010.\n\nPattern Recognition(CVPR), 2009.\n\n[26] X. Yang and L. J. Latecki. Af\ufb01nity learning on a tensor product graph with applications to shape and image\nretrieval. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR),\n2011.\n\n[27] W. Zhong, H. Lu, and M.-H. Yang. Robust object tracking via sparsity-based collaborative model. In\n\nProceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.\n\n[28] D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Scholkopf. Learning with local and global consistency.\n\nIn Advances in Neural Information Processing Systems (NIPS), 2004.\n\n[29] X. Zhu. Semi-supervised learning literature survey. In Technical Report 1530, Department of Computer\n\nSciences, University of Wisconsin, Madison, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1351, "authors": [{"given_name": "Yu", "family_name": "Zhou", "institution": null}, {"given_name": "Xiang", "family_name": "Bai", "institution": null}, {"given_name": "Wenyu", "family_name": "Liu", "institution": null}, {"given_name": "Longin", "family_name": "Latecki", "institution": null}]}