{"title": "Recognizing Activities by Attribute Dynamics", "book": "Advances in Neural Information Processing Systems", "page_first": 1106, "page_last": 1114, "abstract": "In this work, we consider the problem of modeling the dynamic structure of human activities in  the attributes space. A video sequence is first represented in a semantic feature space, where each feature encodes the probability of  occurrence of an activity attribute at a given time. A generative model,  denoted the binary dynamic system (BDS), is proposed to learn both the distribution and dynamics of different activities in this  space. The BDS is a non-linear dynamic system, which extends both the binary  principal component analysis (PCA) and classical linear dynamic systems (LDS), by combining binary observation variables with a hidden Gauss-Markov state  process. In this way, it integrates the representation power of semantic  modeling with the ability of dynamic systems to capture the temporal structure of time-varying processes. An algorithm for learning BDS  parameters, inspired by a popular LDS  learning method from dynamic textures, is proposed. A similarity measure between BDSs, which generalizes  the Binet-Cauchy kernel for LDS, is then introduced and used to design  activity classifiers. The proposed method is shown to outperform similar classifiers  derived from the kernel dynamic system (KDS) and state-of-the-art approaches for  dynamics-based or attribute-based action recognition.", "full_text": "Recognizing Activities by Attribute Dynamics\n\nWeixin Li\n\nNuno Vasconcelos\n\nDepartment of Electrical and Computer Engineering\n\nUniversity of California, San Diego\nLa Jolla, CA 92093, United States\n\n{wel017, nvasconcelos}@ucsd.edu\n\nAbstract\n\nIn this work, we consider the problem of modeling the dynamic structure of hu-\nman activities in the attributes space. A video sequence is \ufb01rst represented in a\nsemantic feature space, where each feature encodes the probability of occurrence\nof an activity attribute at a given time. A generative model, denoted the binary\ndynamic system (BDS), is proposed to learn both the distribution and dynamics\nof different activities in this space. The BDS is a non-linear dynamic system,\nwhich extends both the binary principal component analysis (PCA) and classical\nlinear dynamic systems (LDS), by combining binary observation variables with\na hidden Gauss-Markov state process. In this way, it integrates the representa-\ntion power of semantic modeling with the ability of dynamic systems to capture\nthe temporal structure of time-varying processes. An algorithm for learning BDS\nparameters, inspired by a popular LDS learning method from dynamic textures,\nis proposed. A similarity measure between BDSs, which generalizes the Binet-\nCauchy kernel for LDS, is then introduced and used to design activity classi\ufb01ers.\nThe proposed method is shown to outperform similar classi\ufb01ers derived from the\nkernel dynamic system (KDS) and state-of-the-art approaches for dynamics-based\nor attribute-based action recognition.\n\n1\n\nIntroduction\n\nHuman activity understanding has been a research topic of substantial interest in computer vision [1].\nInspired by the success of the popular bag-of-features (BoF) representation on image classi\ufb01cation\nproblems, it is frequently based on the characterization of video as a collection of orderless spa-\ntiotemporal features [2, 3]. Recently, there have been attempts to extend this representation along\ntwo dimensions that we explore in this work. The \ufb01rst is to introduce richer models for the temporal\nstructure, also known as dynamics, of human actions [4, 5, 6, 7]. This aims to exploit the fact that\nactions are usually de\ufb01ned as sequences of poses, gestures, or other events over time. While desir-\nable, modeling action dynamics can be a complex proposition, and this can sometimes compromise\nthe robustness of recognition algorithms, or sacri\ufb01ce their generality, e.g., it is not uncommon for\ndynamic models to require features speci\ufb01c to certain datasets or action classes [5, 6], or non-trivial\nforms of pre-processing, such as tracking [8], manual annotation [7], etc. The second dimension,\nagain inspired by recent developments in image classi\ufb01cation [9, 10], is to represent actions in\nterms of intermediate-level semantic concepts, or attributes [11, 12]. This introduces a layer of\nabstraction that improves the generalization of the representation, enables modeling of contextual\nrelationships [13], and simpli\ufb01es knowledge transfer across activity classes [11].\nIn this work, we propose a representation that combines the bene\ufb01ts of these two types of extensions.\nThis consists of modeling the dynamics of human activities in the attributes space. The idea is to\nexploit the fact that an activity is usually de\ufb01ned as a sequence of semantic events. For example, the\nactivity \u201cstoring an object in a box\u201d is de\ufb01ned as the sequence of the action attributes \u201cremove (hand\nfrom box)\u201d, \u201cgrab (object)\u201d, \u201cinsert (hand in box)\u201d, and \u201cdrop (object)\u201d. The representation of\n\n1\n\n\fthe action as a sequence of these attributes makes the characterization of the \u201cstoring object in\nbox\u201d activity more robust (to confounding factors such as diversity of grabbing styles, hand motion\nspeeds, or camera motions) than dynamic representations based on low-level features. It is also\nmore discriminant than semantic representations that ignore dynamics, i.e., that simply record the\noccurrence (or frequency) of the action attributes \u201cremove\u201d, \u201cgrab\u201d, \u201cinsert\u201d, and \u201cdrop\u201d. In the\nabsence of information about the sequence in which these attributes occur, the \u201cstore object in box\u201d\nactivity cannot be distinguished from the \u201cretrieve object from box\u201d activity, de\ufb01ned as the sequence\n\u201cinsert (hand in box)\u201d, \u201cgrab (object)\u201d, \u201cremove (hand from box)\u201d, and \u201cdrop (object)\u201d. In summary,\nthe modeling of attribute dynamics is 1) more robust and \ufb02exible than the modeling of visual (low-\nlevel) dynamics, and 2) more discriminant than the modeling of attribute frequencies.\nIn this work, we address the problem of modeling attribute dynamics for activities. As is usual in\nsemantics-based recognition [11], we start by representing video in a semantic feature space, where\neach feature encodes the probability of occurrence of an action attribute in the video, at a given\ntime. We then propose a generative model, denoted the binary dynamic system (BDS), to learn both\nthe distribution and dynamics of different activities in this space. The BDS is a non-linear dynamic\nsystem, which combines binary observation variables with a hidden Gauss-Markov state process.\nIt can be interpreted as either 1) a generalization of binary principal component analysis (binary\nPCA) [14], which accounts for data dynamics, or 2) an extension of the classical linear dynamic\nsystem (LDS), which operates on a binary observation space. For activity recognition, the BDS has\nthe appeal of accounting for the two distinguishing properties of the semantic activity representation:\n1) that semantic vectors de\ufb01ne probability distributions over a space of binary attributes; and 2) that\nthese distributions evolve according to smooth trajectories that re\ufb02ect the dynamics of the underlying\nactivity. Its advantages over previous representations are illustrated by the introduction of BDS-\nbased activity classi\ufb01ers. For this, we start by proposing an ef\ufb01cient BDS learning algorithm, which\ncombines binary PCA and a least squares problem, inspired by the learning procedure in dynamic\ntextures [15]. We then derive a similarity measure between BDSs, which generalizes the Binet-\nCauchy kernel from the LDS literature [16]. This is \ufb01nally used to design activity classi\ufb01ers, which\nare shown to outperform similar classi\ufb01ers derived from the kernel dynamic systems (KDS) [6], and\nstate-of-the-art approaches for dynamics-based [4] and attribute-based [11] action recognition.\n\n2 Prior Work\nOne of the most popular representations for activity recognition is the BoF, which reduces video to\nan collection of orderless spatiotemporal descriptors [2, 3]. While robust, the BoF ignores the tem-\nporal structure of activities, and has limited power for \ufb01ne-grained activity discrimination. A number\nof approaches have been proposed to characterize this structure. One possibility is to represent ac-\ntions in terms of limb or torso motions, spatiotemporal shape models, or motion templates [17, 18].\nSince they require detection, segmentation, tracking, or 3D structure recovery of body parts, these\nrepresentations can be fragile. A robust alternative is to model the temporal structure of the BoF.\nThis can be achieved with generalizations of popular still image recognition methods. For example,\nLaptev et al. extend pyramid matching to video, using a 3D binning scheme that roughly character-\nizes the spatio-temporal structure of video [3]. Niebles et al. employ a latent SVM that augments\nthe BoF with temporal context, which they show to be critical for understanding realistic motion [4].\nAll these approaches have relatively coarse modeling of dynamics. More elaborate models are usu-\nally based on generative representations. For example, Laxton et al. model a combination of object\ncontexts and action sequences with a dynamic Bayesian network [5], while Gaidon et al. reduce\neach activity to three atomic actions and model their temporal distributions [7]. These methods\nrely on action-class speci\ufb01c features and require detailed manual supervision. Alternatively, sev-\neral researchers have proposed to model BoF dynamics with LDSs. For example, Kellokumpu et al.\ncombine dynamic textures [15] and local binary patterns [19], Li et al. perform a discriminant canon-\nical correlation analysis on the space of action dynamics [8], and Chaudhry et al. map frame-wise\nmotion histograms to a reproducing kernel Hilbert space (RKHS), where they learn a KDS [6].\nRecent research in image recognition has shown that various limitations of the BoF can be overcome\nwith representations of higher semantic level [10]. The features that underly these representations\nare con\ufb01dence scores for the appearance of pre-de\ufb01ned visual concepts in images. These concepts\ncan be object attributes [9], object classes [20, 21], contextual classes [13], or generic visual con-\ncepts [22]. Lately, semantic attributes have also been used for action recognition [11], demonstrating\nthe bene\ufb01ts of a mid-level semantic characterization for the analysis of complex human activities.\n\n2\n\n\fFigure 1: Left: key frames of activities \u201churdle race\u201d (top) and \u201clong jump\u201d (bottom); Right: attribute transi-\ntion probabilities of the two activities (\u201churdle race\u201d / \u201clong jump\u201d) for attributes \u201crun\u201d, \u201cjump\u201d, and \u201cland\u201d.\n\nThe work also suggests that, for action categorization, supervised attribute learning is far more useful\nthan unsupervised learning, resembling a similar observation from image recognition [20]. How-\never, all of these representations are BoF-like, in the sense that they represent actions as orderless\nfeature collections, reducing an entire video sequence to an attribute vector. For this reason, we\ndenote them holistic attribute representations.\nThe temporal evolution of semantic concepts, throughout a video sequence, has not yet been ex-\nploited as a cue for action understanding. There has, however, been some progress towards this\ntype of modeling in the text analysis literature, where temporal extensions of latent Dirichlet allo-\ncation (LDA) have been proposed. Two representatives are the dynamic topic model (DTM) [23]\nand the topic over time (TOT) model [24]. Although modeling topic dynamics, these models are not\nnecessarily applicable to semantic action recognition. First, like the underlying LDA, they are un-\nsupervised models, and thus likely to underperform in recognition tasks [11, 10]. Second, the joint\ngoal of topic discovery and modeling topic dynamics requires a complex graphical model. This is\nat odds with tractability, which is usually achieved by sacri\ufb01cing the expressiveness of the temporal\nmodel component.\n\n3 Modeling the Dynamics of Activity Attributes\n\nIn this section, we introduce a new model, the binary dynamic system, for joint representation of the\ndistribution and dynamics of activities in action attribute space.\n\n3.1 Semantic Representation\nSemantic representations characterize video as a collection of descriptors with explicit seman-\ntics [10, 11]. They are obtained by de\ufb01ning a set of semantic concepts (or attributes, scene classes,\netc), and learning a classi\ufb01er to detect each of those concepts. Given a video v \u2208 X to analyze, each\nclassi\ufb01er produces a con\ufb01dence score for the presence of the associated concept. The ensemble of\nclassi\ufb01ers maps the video to a semantic space S, according to \u03c0 : X \u2192 S = [0, 1]K, \u03c0(v) =\n(\u03c01(v),\u00b7\u00b7\u00b7 , \u03c0K(v))T , where \u03c0i(v) is the con\ufb01dence score for the presence of the i-th concept.\nIn this work, the classi\ufb01cation score is the posterior probability of a concept c given video v,\ni.e., \u03c0c(v) = p(c|v) under a certain video representation, e.g., the popular BoF histogram of spatio-\ntemporal descriptors. As the video sequence v progresses with time t, the semantic encoding de\ufb01nes\na trajectory {\u03c0t(v)} \u2282 S. The bene\ufb01ts of semantic representations for recognition, namely a higher\nlevel of abstraction (which leads to better generalization than appearance-based representations),\nsubstantial robustness to the performance of the visual classi\ufb01ers \u03c0i(v), and intrinsic ability to ac-\ncount for contextual relationships between concepts, have been previously documented in the litera-\nture [13]. No attention has, however, been devoted to modeling the dynamics of semantic encodings\nof video. Figure 1 motivates the importance of such modeling for action recognition, by considering\ntwo activity categories (\u201clong jump\u201d and \u201churdle race\u201d), which involve the same attributes, with\nroughy the same probabilities, but span very different trajectories in S. Modeling these dynamics\ncan substantially enhance the ability of a classi\ufb01er to discriminate between complex activities.\n\n3.2 Binary PCA\nThe proposed representation is a generalization of binary PCA [14], a dimensionality reduction\ntechnique for binary data, belonging to the generalized exponential family PCA [25]. It \ufb01ts a linear\nmodel to binary observations, by embedding the natural parameters of Bernoulli distributions in a\nlow-dimensional subspace. Let Y denote a K \u00d7 \u03c4 binary matrix (Ykt \u2208 {0, 1}, e.g., the indicator of\n\n3\n\n.  .  .    .  .  .    .  .  .    .  .  .           .  .  .    .  .  .    .  .  .    .  .  .    .  .  .    .  .  .    .  .  .    .  .  .    .  .  .    .  .  .         run land jump 0.5/0.8 0.5/0.2 0.2/0.7 0.8/0.3 1/0 \foccurrence of attribute k at time t) where each column is a vector of K binary observations sampled\nfrom a multivariate Bernoulli distribution\n\nYkt \u223c B(ykt; \u03c0kt) = \u03c0ykt\n\n(1)\nThe log-odds \u03b8 = log( \u03c0\n1\u2212\u03c0 ) is the natural parameter of the Bernoulli distribution, and \u03c3(\u03b8) =\n(1 + e\u2212\u03b8)\u22121 is the logistic function. Binary PCA \ufb01nds a L-dimensional (L (cid:28) K) embedding of the\nnatural parameters, by maximizing the log-likelihood of the binary matrix Y\n\nkt (1 \u2212 \u03c0kt)1\u2212ykt = \u03c3(\u03b8kt)ykt \u03c3(\u2212\u03b8kt)1\u2212ykt , ykt \u2208 {0, 1}.\n(cid:88)\n\nYkt log \u03c3(\u0398kt) + (1 \u2212 Ykt) log \u03c3(\u2212\u0398kt)\n\nL = log P (Y ; \u0398) =\n\n(cid:104)\n\n(cid:105)\n\n(2)\n\nk,t\n\nunder the constraint\n\n\u0398 = CX + u1T ,\n\n(3)\nwhere C \u2208 RK\u00d7L, X \u2208 RL\u00d7\u03c4 , u \u2208 RK and 1 \u2208 R\u03c4 is the vector of all ones. Each column of C\nis a basis vector of a latent subspace and the t-th column of X contains the coordinates of the t-th\nbinary vector in this basis (up to a translation by u).\nBinary PCA is not directly applicable to attribute-based recognition, where the goal is to \ufb01t the\nvectors of con\ufb01dence scores {\u03c0t} produced by a set of K attribute classi\ufb01ers (and not a sample of\nbinary attribute vectors per se). To overcome this problem, we maximize the expected log-likelihood\nof the data Y (which is the lower bound to the log expected likelihood of the data Y , by Jensen\u2019s\ninequality). Since E[yt] = \u03c0t, it follows from (2) that\n\nEY [L] =\n\n\u03c0kt log \u03c3(\u0398kt) + (1 \u2212 \u03c0kt) log \u03c3(\u2212\u0398kt)\n\n.\n\n(4)\n\n(cid:105)\n\n(cid:88)\n\n(cid:104)\n\nk,t\n\nE[\u2206L({\u03c0t};{\u03c3(\u03b8t)})] = EY\n\nThe proposed extension of binary PCA consists of maximizing this expected log-likelihood under\nthe constraint of (3). It can be shown that, in the absence of the constraint, the maximum occurs\nwhen \u03c3(\u0398kt) = \u03c0kt,\u2200k, t. As in PCA, (3) forces \u03c3(\u0398kt) to lie on a subspace of S, i.e.,\n\n(5)\nThe difference between the expected log-likelihood of the true scores {\u03c0t} and the binary PCA\nscores {\u03c3(\u03b8t) = \u03c3(Cxt + u)} (\u03c3(\u03b8) \u2261 [\u03c3(\u03b81),\u00b7\u00b7\u00b7 , \u03c3(\u03b8K)]T ) is\n\n\u03c3(\u0398kt) = \u02c6\u03c0kt \u2248 \u03c0kt.\n\n(cid:2) log(P (Y ;{\u03c0t}))(cid:3) \u2212 EY\n(cid:20)\n(cid:88)\n(cid:88)\n\n\u03c3(\u0398kt)\n\n\u03c0kt log\n\n\u03c0kt\n\nk,t\n\nKL[B(y; \u03c0t)||B(y; \u03c3(\u03b8t))],\n\n(cid:2) log(P (Y ;{\u03c3(\u03b8t)}))(cid:3)\n(cid:21)\n\n1 \u2212 \u03c0kt\n\u03c3(\u2212\u0398kt)\n\n+ (1 \u2212 \u03c0kt) log\n\n=\n\n(6)\n\n(7)\n\nt\n\n=\n\n(8)\nwhere KL(B(y; \u03c0)||B(y; \u03c0(cid:48))) is the Kullback-Leibler (KL) divergence between two multivariate\nBernoulli distributions of parameters \u03c0 and \u03c0(cid:48). By maximizing the expected log-likelihood (4), the\nt} of the attribute score vectors {\u03c0t} on the subspace of (3) also minimizes the\noptimal projection {\u03b8\n\u2217\nt}, the approximation of (5) is the\nKL divergence of (8). Hence, for the optimal natural parameters {\u03b8\n\u2217\nbest in the sense of KL divergence, the natural similarity measure between probability distributions.\n3.3 Binary Dynamic Systems\nA discrete time linear dynamic system (LDS) is de\ufb01ned by\n\n(cid:26) xt+1 = Axt + vt\n\nyt = Cxt + wt + u\n\n,\n\n(9)\n\nwhere xt \u2208 RL and yt \u2208 RK (of mean u) are the hidden state and observation variable at\ntime t, respectively; A \u2208 RL\u00d7L is the state transition matrix that encodes the underlying dynam-\nics; C \u2208 RK\u00d7L the observation matrix that linearly maps the state to the observation space; and\nx1 = \u00b50 + v0 \u223c N (\u00b50, S0) an initial condition. Both state and observations are subject to addi-\ntive Gaussian noise processes vt \u223c N (0, Q) and wt \u223c N (0, R). Since the noise is Gaussian and\nL < K, the matrix C can be interpreted as a PCA basis for the observation space (L eigenvectors\nof the observation covariance). The state vector xt then encodes the trajectory of the PCA coef\ufb01-\ncients (projection on this basis) of the observed data over time. This interpretation is, in fact, at the\ncore of the popular dynamic texture (DT) [15] representation for video. While the LDS parameters\n\n4\n\n\fAlgorithm 1: Learning a binary dynamic system\nInput\n\n: a sequence of attribute score vectors {\u03c0t}\u03c4\n\nBinary PCA: {C, X, u} = B-PCA({\u03c0t}\u03c4\nEstimate state parameters (X t2\n\nt1 \u2261(cid:2)xt1 ,\u00b7\u00b7\u00b7 , xt2\n\n(cid:3)):\n\nt=1, state space dimension n.\n\nt=1, n) using the method of [14].\n\n(cid:80)\u03c4\n2 (X \u03c4\u22121\n\n)\u2020;\nt=1 xt;\n\n1\n\nA = X \u03c4\n\u00b50 = 1\n\u03c4\n\n(cid:80)\u03c4\n2 \u2212 A(X)\u03c4\u22121\nt=1(xt \u2212 \u00b50)(xt \u2212 \u00b50)T .\n\n; Q = 1\n\n1\n\n\u03c4\u22121 V (V )T ;\n\nV = (X)\u03c4\nS0 = 1\n\u03c4\u22121\n\nOutput: {A, C, Q, u, \u00b50, S0}\n\ncan be learned by maximum likelihood, using an expectation-maximization (EM) algorithm [26],\nthe DT decouples the learning of observation and state variables. Observation parameters are \ufb01rst\nlearned by PCA, and state parameters are then learned with a least squares procedure. This simple\napproximate learning algorithm tends to perform very well, and is widely used in computer vision.\nThe proposed binary dynamic system (BDS) is de\ufb01ned as\n\n(cid:26) xt+1 = Axt + vt\n\n,\n\nyt \u223c B(y; \u03c3(Cxt + u))\n\n(10)\nwhere xt \u2208 RL and u \u2208 RK are the hidden state variable and observation bias, respectively; A \u2208\nRL\u00d7L is the state transition matrix; and C \u2208 RK\u00d7L the observation matrix. The initial condition is\ngiven by x1 = \u00b50 + v0 \u223c N (\u00b50, S0); and the state noise process is vt \u223c N (0, Q). Like the LDS\nof (9), the BDS can be interpreted as combining a (now binary) PCA observation component with\na Gauss-Markov process for the state sequence. As in binary PCA, for attribute-based recognition\nthe binary observations yt are replaced by the attribute scores \u03c0t, their log-likelihood under (10)\nby the expected log-likelihood, and the optimal solution minimizes the approximation of (5) for\nthe most natural de\ufb01nition of similarity (KL divergence) between probability distributions. This is\nconceptually equivalent to the behavior of the canonical LDS of (9), which determines the subspace\nthat best approximates the observations in the Euclidean sense, the natural similarity measure for\nGaussian data. Note that other extensions of the LDS, e.g., kernel dynamic systems (KDS) that rely\non a non-linear kernel PCA (KPCA) [27] of the observation space but still assume an Euclidean\nmeasure (Gaussian noise) [28, 6], do not share this property. We will see, in the experimental\nsection, that the BDS is a better model of attribute dynamics.\n\n3.4 Learning\nSince the Gaussian state distribution of an LDS is a conjugate prior for the (Gaussian) conditional-\ndistribution of its observations given the state, maximum-likelihood estimates of LDS parameters\nare tractable. The LDS parameters \u2126LDS = {A, C, Q, R, \u00b50, S0, u} of (9) can thus be estimated\nwith an EM algorithm [26]. For the BDS, where the state is Gaussian but the observations are not,\nthe expectation step is intractable. Hence, approximate inference is required to learn the parameters\n\u2126BDS = {A, C, Q, \u00b50, S0, u} of (10). In this work, we resort to the approximate DT learning\nprocedure, where observation and state components are learned separately [15]. The binary PCA\nbasis is learned \ufb01rst, by maximizing the expected log-likelihood of (4) subject to the constraint\nof (3). Since the Bernoulli distribution is a member of exponential family, (4) is concave in \u0398, but\nnot in C, X and u jointly. We rely on a procedure introduced by [14], which iterates between the\noptimization with respect to one of the variables C, X and u, with the remaining two held constant.\nEach iteration is a convex sub-problem that can be solved ef\ufb01ciently with a \ufb01xed-point auxiliary\nfunction (see [14] for details). Once the latent embedding C\u2217, X\u2217 and u\u2217 of the attribute sequence\nin the optimal subspace is recovered, the remaining parameters are estimated by solving a least-\nsquares problem for A and Q, and using standard maximum likelihood estimates for the Gaussian\nparameters of the initial condition (\u00b50 and S0) [15]. The procedure is summarized in Algorithm 1.\n\n4 Measuring Distances between BDSs\n\nThe design of classi\ufb01ers that account for attribute dynamics requires the ability to quantify similarity\nbetween BDSs. In this section, we derive the BDS counterpart to the popular Binet-Cauchy ker-\nnel (BCK) for the LDS, which evaluates the similarity of the output sequences of two LDSs. Given\n\n5\n\n\f(cid:104)(cid:88)\u221e\n\nLDSs \u2126a and \u2126b driven by identical noise processes vt and wt with observation sequences y(a)\nand y(b), [16] propose a family of BCKs\n\ne\u2212\u03bbt(y(a)\n\nKBC(\u2126a, \u2126b) = Ev,w\n\n(11)\nwhere W is a semi-de\ufb01nite positive weight matrix and \u03bb (cid:62) 0 a temporal discounting factor. To\nextend (11) to BDSs \u2126a and \u2126b, we note that (y(a)\nis the inner product of an Euclidean\nt \u2212 y(b)\nt \u2212 y(b)\noutput space of metric d2(y(a)\nt ). For BDSs, whose obser-\n)}, for \u2126a, and {\u03c3(\u03b8(b)\nvations yt are Bernouli distributed with parameters {\u03c3(\u03b8(a)\nt )}, for \u2126b, this\ndistance measure is naturally replaced by the KL divergence between Bernoulli distributions\n\n)T W y(b)\nt )T W (y(a)\n\nt ) = (y(a)\n\n, y(b)\n\n)T W y(b)\n\nt=0\n\n,\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\n(cid:105)\n\n(cid:17)(cid:35)\n\n(cid:34) \u221e(cid:88)\n(cid:20)(cid:88)\u221e\n\nt=0\n\ne\n\n\u2212\u03bbt(cid:16)\n\u2212\u03bbt(cid:16)\n\ne\n\nt=0\n\nDBC (\u2126a, \u2126b) = Ev\n\n= Ev\n\nKL(B(\u03c3(\u03b8(a)\n\nt\n\n))||B(\u03c3(\u03b8(b)\n\nt ))) + KL(B(\u03c3(\u03b8(b)\n\nt ))||B(\u03c3(\u03b8(a)\n\nt\n\n)))\n\n(cid:17)T(cid:16)\n\n(cid:17)(cid:21)\n\n\u03c3(\u03b8(a)\n\nt\n\n) \u2212 \u03c3(\u03b8(b)\nt )\n\nt \u2212 \u03b8(b)\n\u03b8(a)\n\nt\n\n,\n\n(12)\n\nwhere \u03b8t = Cxt + u. The distance term at time t can be rewritten as\n\nt\n\n(\u03c3(\u03b8(a)\n\nt ))T (\u03b8(a)\n\n) \u2212 \u03c3(\u03b8(b)\n\nt \u2212 \u03b8(b)\n\n(13)\nt,k \u2212\nwith \u02c6Wt a diagonal matrix whose k-th diagonal element is \u02c6Wt,k = (\u03c3(\u0398(a)\nt,k) = \u03c3(cid:48)( \u02c6\u0398(a,b)\n\u0398(b)\nt,k and\n\u02c6\u0398(b)\nt,k). This reduces (13) to a form similar to (11), although with a time varying weight matrix Wt.\nIt is unclear whether (12) can be computed in closed-form. We currently rely on the approximation\n\nt,k))/(\u0398(a)\nis some real value between \u02c6\u0398(a)\n\nt,k ) (where, by the mean value theorem, \u02c6\u0398(a,b)\n\nt \u2212 \u03b8(b)\nt ),\nt,k ) \u2212 \u03c3(\u0398(b)\n\nt )T \u02c6Wt(\u03b8(a)\n\nt \u2212 \u03b8(b)\n\nt ) = (\u03b8(a)\n\nt,k\n\nt=0 e\u2212\u03bbt(\u03c3(\u00af\u03b8(a)\n\nt\n\n) \u2212 \u03c3(\u00af\u03b8(b)\n\nt ))T (\u00af\u03b8(a)\n\nt \u2212 \u00af\u03b8(b)\n\nt ), where \u00af\u03b8 is the mean of \u03b8.\n\nDBC(\u2126a, \u2126b) \u2248(cid:80)\u221e\n\n5 Experiments\n\nSeveral experiments were conducted to evaluate the BDS as a model of activity attribute dynam-\nics. In all cases, the BoF was used as low-level video representation, interest points were detected\nwith [2], and HoG/HoF descriptors [3] computed at their locations. A codebook of 3000 visual\nwords was learned via k-means, from the entire training set, and a binary SVM with histogram\nintersection kernel (HIK) and probability outputs [29] trained to detect each attribute using the at-\ntribute de\ufb01nition same as [11]. The probability for attribute k at time t was used as attribute score\n\u03c0tk, which was computed over a window of 20 frames, sliding across a video.\n\n5.1 Weizmann Activities\nTo obtain some intuition on the performance of different algorithms considered, we \ufb01rst used com-\nplex activity sequences synthesized from the Weizmann dataset [17]. This contains 10 atomic action\nclasses (e.g., skipping, walking) annotated with respect to 30 lower-level attributes (e.g., \u201cone-arm-\nmotion\u201d), and performed by 9 people. We created activity sequences by concatenating Weizmann\nactions. A sequence of degree n (n = 4, 5, 6) is composed of n atomic actions, performed by the\nsame person. The row of images at the top of Figure 2 presents an example of an activity sequence of\ndegree 5. The images shown at the top of the \ufb01gure are keyframes from the atomic actions (\u201cwalk\u201d,\n\u201cpjump\u201d, \u201cwave1\u201d, \u201cwave2\u201d, \u201cwave2\u201d) that compose this activity sequence. The black curve (la-\nbeled \u201cSem. Seq\u201d) in the plot at the bottom of the \ufb01gure shows the score of the \u201ctwo-arms-motion\u201d\nattribute, as a function of time. 40 activity categories were de\ufb01ned per degree n (total of 120 activity\ncategories) and a dataset was assembled per category, containing one activity sequence per person (9\npeople, 1080 sequences in total). Overall, the activity sequences differ in the number, category, and\ntemporal order of atomic actions. Since the attribute ground truth is available for all atomic actions\nin this dataset, it is possible to train clean attribute models. Hence, all performance variations can\nbe attributed to the quality of the attribute-based inference of different approaches.\nWe started by comparing the binary PCA representation that underlies the BDS to the PCA and\nKPCA decompositions of the LDS and KDS. In all cases we projected a set of attribute score vectors\n{\u03c0t} into the low-dimensional PCA subspace, computed the reconstructed score vectors { \u02c6\u03c0t}, and\nthe KL divergence KL(B(y, \u03c0t)||B(y, \u02c6\u03c0t), as reported in Figure 3. The kernel used for KPCA was\n\n6\n\n\fFigure 2: Top: key frames from the activity sequence class \u201cwalk-\npjump-wave1-wave2-wave2\u201d. Bottom: score of \u201ctwo-arms-motion\u201d\nattribute for video of this activity. True scores in black, and scores\nsampled from the BDS (red) and KDS (blue). Also shown is the KL-\ndivergence between sampled and original scores, for both models.\n\nFigure 3:\nLog KL-divergence be-\ntween original and reconstructed at-\ntribute scores, v.s. number of PCA com-\nponents n, on Weizmann activities for\nPCA, kernel PCA, and binary PCA.\n\nTable 1: Classi\ufb01cation Accuracy on Weizmann Activities and Olympic Sports Datasets\n\nDataset\n\nWeizmann Activities\n\nOlympic Sports\n\nBoF\n57.8%\n56.8%\n\nHolistic Attri.\n\n72.6%\n63.5%\n\nTOT\n\nDTM\nBDS\n84.6% 88.2% 90.2% 94.8%\n47.1% 53.3% 62.3% 65.7%\n\nKDS\n\nthe logit kernel K(\u03c01, \u03c02) = \u03c3\u22121(\u03c01)T \u03c3\u22121(\u03c02), where \u03c3\u22121(\u00b7) is the element-wise logit function.\nFigure 3 shows the average log-KL divergence, over the entire dataset, as a function of the number of\nPCA components used in the reconstruction. Binary PCA outperformed both PCA and KPCA. The\nimprovements over KPCA are particularly interesting since the latter uses the logistic transformation\nthat distinguishes binary PCA from PCA. This is explained by the Euclidean similarity measure that\nunderlies the assumption of Gaussian noise in KPCA, as discussed in Section 3.3. To gain some\nmore insight on the different models, a KDS and a BDS were learned from the 30 dimensional\nattribute score vectors of the activity sequence in Figure 2. A new set of attribute score vectors were\nthen sampled from each model. The evolution of the scores sampled for the \u201ctwo-arms-motion\u201d\nattribute are shown in the \ufb01gure (in red/blue for BDS/KDS). Note how the scores sampled from the\nBDS approximate the original attribute scores better than those sampled from the KDS, which is\ncon\ufb01rmed by the KL-divergences between the original attribute scores and those sampled from the\ntwo models (also shown in the \ufb01gure).\nWe next evaluated the bene\ufb01ts of different dynamics representations for activity recognition. Recog-\nnition rates were obtained with a 9-fold leave-one-out-cross-validation (LOOCV), where, per trial,\nthe activities of one subject were used as test set and those of the remaining 8 as training set. We\ncompared the performance of classi\ufb01ers based on the KDS and BDS with a BoF classi\ufb01er, a holistic\nattribute classi\ufb01er that ignores attribute dynamics (using a single attribute score vector computed\nfrom the entire video sequence) and the dynamic topic models DTM [23] and TOT [24] from the\ntext literature. For the latter, the topics were equated to the activity attributes and learned with su-\npervision (using the SVMs discussed above). Unsupervised versions of the topic models had worse\nperformance and are omitted. Classi\ufb01cation was performed with Bayes rule for topic models, and a\nnearest-neighbor classi\ufb01er for the remaining methods. For BDS, distances were measured with (12),\nwhile for the KDS we tried the Binet-Cauchy, X 2, intersection and logit kernels, and reported the\nbest results. X 2 distance was used for the BoF and holistic attribute classi\ufb01ers. The classi\ufb01cation\naccuracy of all classi\ufb01ers is shown in Table 1. BDS and KDS had the best performance, followed by\nthe dynamic topic models, and the dynamics insensitive methods (BoF and holistic). Note that the\ndifference between the holistic classi\ufb01er and the best dynamic model is of approximately 22%. This\nshows that while attributes are important (14.8% improvement over BoF) they are not the whole\nstory. Problems involving \ufb01ne-grained activity classi\ufb01cation, i.e., discrimination between activities\ncomposed of similar actions executed in different sequence, requires modeling of attribute dynamics.\nAmong dynamic models, the BDS outperformed the KDS, and topic models DTM and TOT.\n\n5.2 Olympic Sports\nThe second set of experiments was performed on the Olympic Sports dataset [4]. This contains\nYouTuBe videos of 16 sport activities, with a total of 783 sequences. Some activities are sequences\n\n7\n\n\u0003\u0007\u0003\u0004\u0003\u0003\u0004\u0007\u0003\u0005\u0003\u0003\u0005\u0007\u0003\u0003\u0003\u0002\u0005\u0003\u0002\u0006\u0003\u0002\b\u0003\u0002\t\u0004\n\u001e\u001e\u001c\u0016\u0013\u001f\u001e\u0015\u0001\u0011\u0014\u001a\u001c\u0015\u001e\u0001\u0001\u0003\u0003\u0002\u0003\u0005\u0003\u0002\u0003\u0006\u0003\u0002\u0003\b\u0003\u0002\u0003\t\u0003\u0002\u0004\u0010\u0015\u0012\u0019\u0001\u000e\u000f\u0011\u0015\u0018\u0002\u0001\u0011\u0015\u001b\u0002\r\u001a\u0017\u0016\u001d\u001e\u0016\u0014\u000b\f\u0011\u000e\f\u0011\u000e\u000f\u0001\u000b\f\u0011\u000e\u000f\u0001\u000e\f\u001101234567\u22124\u221220246nlog KL\u2212div  PCAkernel\u2212PCAbinary\u2212PCA\fTable 2: Fine-grained Classi\ufb01cation Accuracy on Olympic Sports by BDS\n\nMethod\n\nBDS\n\nHolistic\n\nclean&jerk\n\n(snatch)\n85% (9%)\n73% (21%)\n\nlong-jump\n(triple-jump)\n80% (2%)\n72% (20%)\n\nsnatch\n\n(clean&jerk)\n78% (10%)\n65% (27%)\n\ntriple-jump\n(long-jump)\n62% (14%)\n38% (43%)\n\nTable 3: Mean Average Precision on Olympic Sports Dataset\n\nLaptev et al. [3]\n\n( BoF )\n62.0%\n( 67.8% )\n\nNiebles et al. [4]\n\n( BDS )\n72.1%\n(73.2%)\n\nLiu et al. [11]\n( Attr. / B+A )\n\n74.4%\n\n(72.9% / 73.3%)\n\nB+A+D\n\n76.5%\n\nFigure 4:\ngain on Olympic Sports by BDS.\n\nScatter plot of accuracy\n\nof atomic actions, whose temporal structure is critical for discrimination from other classes (e.g.,\n\u201cclean and jerk\u201d v.s.\u201csnatch\u201d, and \u201clong-jump\u201d v.s.\u201ctriple-jump\u201d). Since attribute labels are only\navailable for whole sequences, the training sets of the attribute classi\ufb01ers are much noisier than\nin the previous experiment. This degrades the quality of attribute models. The dataset was split\ninto 5 subsets, of roughly the same size, and results reported by 5-fold cross-validation. The\nDTM and TOT classi\ufb01ers were as above, and all others were implemented with an SVM of ker-\nnel K\u03b1(i, j) = exp(\u2212 1\n\u03b1 d2(i, j)), based on the distance measures d(i, j) of the previous section.\nTable 1 shows that dynamic modeling again has the best performance. However, the gains over the\nholistic attribute classi\ufb01er are smaller than in Weizmann. This is due to two factors. First, the noisy\nattributes make the dynamics harder to model. Note that the robustness of the dynamic models to\nthis noise varies substantially. As before, topic models have the weakest performance and the BDS\noutperforms the KDS. Second, since \ufb01ne grained discrimination is not needed for all categories,\nattribute dynamics are not always necessary. This is con\ufb01rmed by Figure 4, which presents a scatter\nplot of the gain (difference in accuracy) of the BDS classi\ufb01er over the holistic classi\ufb01er, as a func-\ntion of the accuracy of the latter. Each point corresponds to an activity. Note the strongly negative\ncorrelation between the two factors: the largest gains occur for the most dif\ufb01cult classes for the\nholistic classi\ufb01er. Table 2 details these results for the two pairs of classes with most confusable at-\ntributes. Numbers outside brackets correspond to ground-truth category, numbers in brackets to the\nconfusing class (percentage of ground-truth examples assigned to it). BDS has dramatically better\nperformance for these classes. Overall, despite the attribute noise and the fact that dynamics are not\nalways required for discrimination, the BDS achieves the best performance on this dataset.\nFinally, we compare the BDS classi\ufb01er to classi\ufb01ers from the literature. Three approaches, rep-\nresentative of the state-of-the art in classi\ufb01cation with the BoF [3], dynamic representations [4],\nand attributes [11], were selected as benchmarks. These were compared to our implementation\nof BoF (kernel using only word histograms), attributes (the holistic classi\ufb01er of Table 1), dynam-\nics (the BDS classi\ufb01er), and multiple kernel classi\ufb01ers combining 1) BoF and attributes (B+A), and\n2) BoF, attributes, and dynamics (B+A+D). All multiple kernels combinations were achieved by\ncross-validation. The mean average precisions of all 1-vs-all classi\ufb01ers are reported in Table 3. The\nnumbers in each column report to directly comparable classi\ufb01ers, e.g., B+A is directly comparable\nto [11], which jointly classi\ufb01es BoF histograms and hollistic attribute vectors with a latent SVM.\nNote that the BDS classi\ufb01er outperforms the state-of-the-art in dynamic classi\ufb01ers (Niebles et al.\n[4]), which accounts for the dynamics of the BoF but not action attributes. This holds despite the fact\nthat our attribute categories (only 40 speci\ufb01ed attributes) and classi\ufb01ers (simple SVMs) are much\nsimpler than the best in the literature [11] , which uses both the data-driven and the 40 speci\ufb01ed\nattributes as ours, plus a latent SVM as the classi\ufb01er. The use of a stronger attribute detection archi-\ntecture could potentially further improve these results. Note also that the addition of the BDS kernel\nto the simple attribute representation (B+A+D) far outperforms the use of the more sophisticated at-\ntribute classi\ufb01er of [11], which does not account for attribute dynamics. This illustrates the bene\ufb01ts\nof modeling the dynamics of attributes. The combination of BoF, attributes, and attribute dynamics\nachieves the overall best performance on this dataset.\n\nAcknowledgements\n\nThis work was partially supported by the NSF award under Grant CCF-0830535. We also thank\nJingen Liu for providing the attribute annotations.\n\n8\n\n\u0003\u0002\u0006\b\u0003\u0002\u0007\u0003\u0002\u0007\b\u0003\u0002\b\u0003\u0002\b\b\u0003\u0002\t\u0003\u0002\t\b\u0003\u0002\n\u0003\u0002\n\b\u0003\u0002\u000b\u0003\u0002\u000b\b\u0001\u0003\u0002\u0004\b\u0001\u0003\u0002\u0004\u0001\u0003\u0002\u0003\b\u0003\u0003\u0002\u0003\b\u0003\u0002\u0004\u0003\u0002\u0004\b\u0003\u0002\u0005\u0003\u0002\u0005\b\f\u0010\u0010!\u001e\u000e\u0010#\u0001\u0013\u001c\u001e\u0001\r\u001c\u0019\u0016\u001f \u0016\u0010\u0001\f  \u001e\u0016\u000f! \u0012\u0001\u0001\f\u0010\u0010!\u001e\u000e\u0010# \u001e\u0016\u001d\u0019\u0012\u0001\u0017!\u001a\u001d\u001f\u001b\u000e \u0010\u0015\u0010\u0019\u0012\u000e\u001b\u0001\u000e\u001b\u0011\u0001\u0017\u0012\u001e\u0018\u001d\u001c\u0019\u0012\u0001\"\u000e!\u0019 \u0011\u0016\"\u0016\u001b\u0014\u0001\u001d\u0019\u000e \u0013\u001c\u001e\u001a\u0001\u0004\u0003\u001a\u0015\u0016\u0014\u0015\u0001\u0017!\u001a\u001d\"\u000e!\u0019 \fReferences\n[1] J. K. Aggarwal and M. S. Ryoo, \u201cHuman activity analysis: A review,\u201d ACM Computing Surveys, vol. 43,\n\nno. 16, pp. 1\u201316, 2011.\n\n[2] P. Doll\u00b4ar, V. Rabaud, G. Cottrell, and S. Belongie, \u201cBehavior recognition via sparse spatio-temporal\n\nfeatures,\u201d ICCV VS-PETS, 2005.\n\n[3] I. Laptev, M. Marsza\u0142ek, C. Schmid, and B. Rozenfeld, \u201cLearning realistic human actions from movies,\u201d\n\nCVPR, 2008.\n\n[4] J. C. Niebles, C.-W. Chen, and L. Fei-Fei, \u201cModeling temporal structure of decomposable motion seg-\n\nments for activity classi\ufb01cation,\u201d ECCV, 2010.\n\n[5] B. Laxton, J. Lim, and D. Kriegman, \u201cLeveraging temporal, contextual and ordering constraints for rec-\n\nognizing complex activities in video,\u201d CVPR, 2007.\n\n[6] R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal, \u201cHistograms of oriented optical \ufb02ow and binet-\n\ncauchy kernels on nonlinear dynamical systems for the recognition of human actions,\u201d CVPR, 2009.\n\n[7] A. Gaidon, Z. Harchaoui, and C. Schmid, \u201cActom sequence models for ef\ufb01cient action detection,\u201d CVPR,\n\n2011.\n\n[8] B. Li, M. Ayazoglu, T. Mao, O. Camps, and M. Sznaier, \u201cActivity recognition using dynamic subspace\n\nangles,\u201d CVPR, 2011.\n\n[9] C. H. Lampert, H. Nickisch, and S. Harmeling, \u201cLearning to detect unseen object classes by between-class\n\nattribute transfer,\u201d CVPR, 2009.\n\n[10] N. Rasiwasia and N. Vasconcelos, \u201cHolistic context models for visual recognition,\u201d IEEE Trans. Pattern\n\nAnalysis and Machine Intelligence, vol. 34, no. 5, pp. 902\u2013917, 2012.\n\n[11] J. Liu, B. Kuipers, and S. Savarese, \u201cRecognizing human actions by attributes,\u201d CVPR, 2011.\n[12] A. Fathi and G. Mori, \u201cAction recognition by learning mid-level motion features,\u201d CVPR, 2008.\n[13] N. Rasiwasia and N. Vasconcelos, \u201cHolistic context modeling using semantic co-occurrences,\u201d CVPR,\n\n2009.\n\n[14] A. I. Schein, L. K. Saul, and L. H. Ungar, \u201cA generalized linear model for principal component analysis\n\nof binary data,\u201d AISTATS, 2003.\n\n[15] G. Doretto, A. Chiuso, Y. N. Wu, and S. Soatto, \u201cDynamic textures,\u201d Int\u2019l J. Computer Vision, vol. 51,\n\nno. 2, pp. 91\u2013109, 2003.\n\n[16] S. V. N. Vishwanathan, A. J. Smola, and R. Vidal, \u201cBinet-cauchy kernels on dynamical systems and its\napplication to the analysis of dynamic scenes,\u201d Int\u2019l J. Computer Vision, vol. 73, no. 1, pp. 95\u2013119, 2006.\n[17] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, \u201cActions as space-time shapes,\u201d IEEE Trans.\n\nPattern Analysis and Machine Intelligence, vol. 29, no. 12, pp. 2247\u20132253, 2007.\n\n[18] N. \u02d9Ikizler and D. A. Forsyth, \u201cSearching for complex human activities with no visual examples,\u201d Int\u2019l J.\n\nComputer Vision, vol. 80, no. 3, pp. 337\u2013357, 2008.\n\n[19] V. Kellokumpu, G. Zhao, and M. Pietik\u00a8ainen, \u201cHuman activity recognition using a dynamic texture based\n\nmethod,\u201d BMVC, 2008.\n\n[20] N. Rasiwasia and N. Vasconcelos, \u201cScene classi\ufb01cation with low-dimensional semantic spaces and weak\n\nsupervision,\u201d CVPR, 2008.\n\n[21] A. Quattoni, M. Collins, and T. Darrell, \u201cLearning visual representations using images with captions,\u201d\n\nCVPR, 2007.\n\n[22] N. Rasiwasi, P. J. Moreno, and N. Vasconcelos, \u201cBridging the gap: Query by semantic example,\u201d IEEE\n\nTrans. Multimedia, vol. 9, no. 5, pp. 923\u2013938, 2007.\n\n[23] D. M. Blei and J. D. Lafferty, \u201cDynamic topic models,\u201d ICML, 2006.\n[24] X. Wang and A. McCallum, \u201cTopics over time: a non-markov continuous-time model of topical trends,\u201d\n\nACM SIGKDD, 2006.\n\n[25] M. Collins, S. Dasgupta, and R. E. Schapire, \u201cA generalization of principal component analysis to the\n\nexponential family,\u201d NIPS, 2002.\n\n[26] R. H. Shumway and D. S. Stoffer, \u201cAn approach to time series smoothing and forecasting using the em\n\nalgorithm,\u201d Journal of Time Series Analysis, vol. 3, no. 4, pp. 253\u2013264, 1982.\n\n[27] B. Sch\u00a8olkopf, A. Smola, and K.-R. M\u00a8uller, \u201cNonlinear component analysis as a kernel eigenvalue prob-\n\nlem,\u201d Neural Computation, vol. 10, pp. 1299\u20131319, 1998.\n\n[28] A. B. Chan and N. Vasconcelos, \u201cClassifying video with kernel dynamic textures,\u201d CVPR, 2007.\n[29] C.-C. Chang and C.-J. Lin, \u201cLIBSVM: A library for support vector machines,\u201d ACM Trans. on Intelligent\n\nSystems and Technology, vol. 2, no. 3, pp. 27:1\u201327:27, 2011.\n\n9\n\n\f", "award": [], "sourceid": 535, "authors": [{"given_name": "Weixin", "family_name": "Li", "institution": null}, {"given_name": "Nuno", "family_name": "Vasconcelos", "institution": null}]}