{"title": "Subset Selection and Summarization in Sequential Data", "book": "Advances in Neural Information Processing Systems", "page_first": 1035, "page_last": 1045, "abstract": "Subset selection, which is the task of finding a small subset of representative items from a large ground set, finds numerous applications in different areas. Sequential data, including time-series and ordered data, contain important structural relationships among items, imposed by underlying dynamic models of data, that should play a vital role in the selection of representatives. However, nearly all existing subset selection techniques ignore underlying dynamics of data and treat items independently, leading to incompatible sets of representatives. In this paper, we develop a new framework for sequential subset selection that finds a set of representatives compatible with the dynamic models of data. To do so, we equip items with transition dynamic models and pose the problem as an integer binary optimization over assignments of sequential items to representatives, that leads to high encoding, diversity and transition potentials. Our formulation generalizes the well-known facility location objective to deal with sequential data, incorporating transition dynamics among facilities. As the proposed formulation is non-convex, we derive a max-sum message passing algorithm to solve the problem efficiently. Experiments on synthetic and real data, including instructional video summarization, show that our sequential subset selection framework not only achieves better encoding and diversity than the state of the art, but also successfully incorporates dynamics of data, leading to compatible representatives.", "full_text": "Subset Selection and Summarization\n\nin Sequential Data\n\nEhsan Elhamifar\n\nM. Clara De Paolis Kaluza\n\nComputer and Information Science College\n\nComputer and Information Science College\n\nNortheastern University\n\nBoston, MA 02115\n\neelhami@ccs.neu.edu\n\nNortheastern University\n\nBoston, MA 02115\n\nclara@ccs.neu.edu\n\nAbstract\n\nSubset selection, which is the task of \ufb01nding a small subset of representative items\nfrom a large ground set, \ufb01nds numerous applications in different areas. Sequential\ndata, including time-series and ordered data, contain important structural relation-\nships among items, imposed by underlying dynamic models of data, that should\nplay a vital role in the selection of representatives. However, nearly all existing\nsubset selection techniques ignore underlying dynamics of data and treat items\nindependently, leading to incompatible sets of representatives. In this paper, we\ndevelop a new framework for sequential subset selection that \ufb01nds a set of represen-\ntatives compatible with the dynamic models of data. To do so, we equip items with\ntransition dynamic models and pose the problem as an integer binary optimization\nover assignments of sequential items to representatives, that leads to high encoding,\ndiversity and transition potentials. Our formulation generalizes the well-known\nfacility location objective to deal with sequential data, incorporating transition\ndynamics among facilities. As the proposed formulation is non-convex, we derive a\nmax-sum message passing algorithm to solve the problem ef\ufb01ciently. Experiments\non synthetic and real data, including instructional video summarization, show that\nour sequential subset selection framework not only achieves better encoding and\ndiversity than the state of the art, but also successfully incorporates dynamics of\ndata, leading to compatible representatives.\n\n1\n\nIntroduction\n\nSubset selection is the task of \ufb01nding a small subset of most informative items from a ground set.\nBesides helping to reduce the computational time and memory of algorithms, due to working on a\nmuch smaller representative set [1], it has found numerous applications, including, image and video\nsummarization [2, 3, 4], speech and document summarization [5, 6, 7], clustering [8, 9, 10, 11, 12],\nfeature and model selection [13, 14, 15, 16], sensor placement [17, 18], social network marketing [19]\nand product recommendation [20]. Compared to dictionary learning methods such as Kmeans [21],\nKSVD [22] and HMMs [23], that learn centers/atoms in the input-space, subset selection methods\nchoose centers/atoms from the given set of items.\nSequential data, including time-series such as video, speech, audio and sensor measurements as well\nas ordered data such as text, form an important large part of modern datasets, requiring effective subset\nselection techniques. Such datasets contain important structural relationships among items, often\nimposed by underlying dynamic models, that should play a vital role in the selection of representatives.\nFor example, there exists a logical way in which segments of a video or sentences of a document are\nconnected together and treating segments/sentences as a bag of randomly permutable items results\nin losing the semantic content of the video/document. However, existing subset selection methods\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fx1 x2\n\nxM\n\nd1,1\n\ny1 y2 y3\n\ndM,T\n\nyT\n\nFigure 1: We propose a framework, based on a generalization of the facility location problem, for the\nsummarization of sequential data. Given a source set of items {x1, . . . , xM} with a dynamic transition model\nand a target set of sequential items (y1, . . . , yT ), we propose a framework to \ufb01nd a sequence of representatives\nfrom the source set that has a high global transition probability and well encodes the target set.\n\nignore these relationships and treat items independent from each other. Thus, there is a need for\nsequential subset selection methods that, instead of treating items independently, use the underlying\ndynamic models of data to select high-quality, diverse and compatible representatives.\nPrior Work: A subset selection framework consists of three main components: i) the inputs to the\nalgorithm; ii) the objective function to optimize, characterizing the informativeness and diversity of\nselected items; iii) the algorithm to optimize the objective function. The inputs to subset selection\nalgorithms are in the form of either feature vector representations or pairwise similarities between\nitems. Several subset selection criteria have been studied in the literature, including maximum cut\nobjective [24, 25], maximum marginal relevance [26], capacitated and uncapacitated facility location\nobjectives [27, 28], multi-linear coding [29, 30] and maximum volume subset [6, 31], which all try to\ncharacterize the informativeness/value of a subset of items in terms of ability to represent the entire\ndistribution and/or having minimum information overlap among selected items. On the other hand,\noptimizing almost all subset selection criteria is, in general, NP-hard and non-convex [25, 32, 33, 34],\nwhich has motivated the development and study of approximate methods for optimizing these criteria.\nThis includes greedy approximate algorithms [28] for maximizing submodular functions, such as\ngraph-cuts and facility location, which have worst-case approximation guarantees, as well as sampling\nmethods from Determinantal Point Process (DPP) [6, 31], a probability measure on the set of all\nsubsets of a ground set, for approximately \ufb01nding the maximum volume subset. Motivated by the\nmaturity of convex optimization and advances in sparse and low-rank recovery, recent methods have\nfocused on convex relaxation-based methods for subset selection [8, 9, 2, 35, 36].\nWhen it comes to sequential data, however, the majority of subset selection methods ignore the\nunderlying dynamics and relationships among items and treat items independent from each other.\nRecent results in [37, 3] have developed interesting extensions to DPP-based subset selection, by\ncapturing representatives in a sequential order such that newly selected representatives are diverse\nwith respect to the previously selected ones. However, sequential diversity by itself is generally\ninsuf\ufb01cient, especially, when the sequence of diverse selected items are unlikely to follow each other\naccording to underlying dynamic models. For example, in a video/document on a speci\ufb01c topic\nwith intermediate irrelevant scenes/sentences to the topic, promoting sequential diversity results\nin selecting irrelevant scenes/sentences.\n[38] extends submodular functions to capture ordered\npreferences among items, where ordered preferences are represented by a directed acyclic graph over\nitems, and presents a greedy algorithm to pick edges instead of items. The method, however, cannot\ndeal with arbitrary graphs, such as Markov chains with cycles. On the other hand, while Hidden\nMarkov Models (HMMs) [23, 39] and dynamical systems [40, 41] have been extensively studied for\nmodeling sequential data, they have not been properly exploited in the context of subset selection.\n\nPaper Contributions: In this paper we develop a new framework for sequential subset selection that\nincorporates the dynamic model of sequential data into subset selection. We develop a new class\nof objective functions that promotes to select not only high-quality and diverse items, but also a\nsequence of representatives that are compatible with the dynamic model of data. To do so, we propose\na dynamic subset selection framework, where we equip items with transition probabilities and design\nobjective functions to select representatives that well capture the data distribution with a high overall\ntransition probability in the sequence of representatives, see Figure 1. Our formulation generalizes\nthe facility location objective [27, 28] to sequential data, by incorporating transition dynamics among\nfacilities. Since our proposed integer binary optimization is non-convex, we develop a max-sum\n\n2\n\n\fmessage passing framework to solve the problem ef\ufb01ciently. By experiments on synthetic and real\ndata, including instructional video summarization, we show that our method outperforms the state of\nthe art in terms of selecting representatives with better encoding, diversity and dynamic compatibility.\n\n2 Subset Selection for Sequential Data\n\nSequential data, including time-series and ordered data contain important structural relationships\namong items, often imposed by underlying dynamic models of data, that should play a vital role in\nthe selection of representatives. In this section, we develop a new framework for sequential subset\nselection that incorporates underlying dynamic models and relationships among items into subset\nselection. More speci\ufb01cally, we propose a dynamic subset selection framework, where we equip\nitems with transition probabilities and design objectives to select representatives that capture the data\ndistribution with a high transition probability in the sequence of representatives. In the next section,\nwe develop an ef\ufb01cient algorithm to solve the proposed optimization problem.\n\n2.1 Sequential Subset Selection Formulation\nAssume we have a source set of items X = {x1, . . . , xM}, equipped with a transition model,\np(xi0|xi1, . . . , xin), between items, and a target set of sequential items Y = (y1, . . . , yT ). Our\ngoal is to \ufb01nd a small representative subset of X that well encode Y, while the set of representatives\nare compatible according to the dynamic model of X. Let xrt be the representative of yt for\nt 2 {1, . . . , T}. We propose a potential function (r1, . . . , rT ) whose maximization over all\npossible assignments (r1, . . . , rT ) \u2713 {1, . . . , M}T , i.e.,\n(1)\n\n (r1, . . . , rT )\n\nmax\n\n(r1,...,rT )\u2713{1,...,M}T\n\nachieves the three goals of i) minimizing the encoding cost of Y via the representative set; ii) selecting\na small set of representatives from X; iii) selecting an ordered set of representatives (xr1, . . . , xrT )\nthat are compatible with the dynamics on X.\nTo tackle the problem, we consider a decomposition of the potential function into the product of\nthree potentials, corresponding to the three aforementioned objectives, as\n\n (r1, . . . , rT ) , enc(r1, . . . , rT ) \u21e5 card(r1, . . . , rT ) \u21e5 dyn(r1, . . . , rT ),\n\n(2)\nwhere enc(r1, . . . , rT ) denotes the encoding potential that favors selecting a representative set from\nX that well encodes Y, card(r1, . . . , rT ) denotes the cardinality potential that favors selecting a\nsmall number of distinct representatives. Finally, dyn(r1, . . . , rT ) denotes the dynamic potential\nthat favors selecting an ordered set of representatives that are likely to be generated by the underlying\ndynamic model on X. Next, we study each of the three potentials.\n\nEncoding Potential: Since the encoding of each item of Y depends on its own representative, we\nassume that the encoding potential function factorizes as\n\nenc(r1, . . . , rT ) =\n\nenc,t(rt),\n\n(3)\n\nwhere enc,t(i) characterizes how well xi encodes yt and becomes larger when xi better represents\nyt. In this paper, we assume that enc,t(i) = exp(di,t), where di,t indicates the dissimilarity of xi\nto yt.1 A lower dissimilarity di,t means that xi better encodes/represents yt.\nCardinality Potential: Notice that maximizing the encoding potential alone results in selecting\nmany representatives. Hence, we consider a cardinality potential to restrict the total number of\nrepresentatives. Denoting the number of representatives by |{r1, . . . , rT}|, we consider\n\ncard(r1, . . . , rT ) = exp(|{r1, . . . , rT}|),\n\n(4)\nwhich promotes to select a small number of representatives. The parameter  > 0 controls the effect\nof the cardinality on the global potential , where a close to zero  ignores the effect of cardinality\npotential, resulting in many representatives, and a larger  results in a smaller representative set.\n1We can also use similarities si,t instead of dissimilarities, in which case we set enc,t(i) = exp(si,t).\n\nTYt=1\n\n3\n\n\fDynamic Potential: While encoding and cardinality potentials together promote selecting a few\nrepresentatives from X that well encode Y, there is no guarantee that the sequence of representatives\n(xr1, . . . , xrT ) is compatible with the underlying dynamic of X. Thus, we introduce a dynamic\npotential that measures the compatibility of the sequence of representatives. To do so, we consider\nan n-th order Markov Model to represent the dynamic relationships among the items in X, where\nthe selection of the representative xrt depends on the m previously selected representatives, i.e.,\nxrt1, . . . , xrtn. More precisely, we consider\n\npt(xrt|xrt1, . . . , xr1)\u21e5\n\ndyn(r1, . . . , rT ) = p1(xr1)\u21e5\npt(xrt|xrt1, . . . , xrtn)!\nwhere pt(xi) indicates the probability of selecting xi as the representative of yt and\npt(xi0|xi1, . . . , xin) denotes the probability of selecting xi0 as the representative of yt given that\nxi1, . . . , xin has been selected as the representative of yt1, . . . , ytn, respectively. The regulariza-\ntion parameter  > 0 determines the effect of the dynamic potential on the overall potential , where\na close to zero  results in discounting the effect of the dynamic of X. As a result, maximizing the\ndynamic potential promotes to select a sequence of representatives that are highly likely to follow the\ndynamic model on the source set. In this paper, we assume that the transition dynamic model on the\nsource set is given and known. In the experiments on video summarization, we learn the dynamic\nmodel by \ufb01tting a hidden Markov Model to data.\n\nTYt=n+1\n\nnYt=2\n\n, (5)\n\nTYt=2\n\n2.2 Optimization Framework for Sequential Subset Selection\nIn the rest of the paper, we consider a \ufb01rst order Markov model, which performs well in the application\nstudied in the paper (our proposed optimization can be generalized to n-th order Markov models as\nwell). Putting all three potentials together, we consider maximization of the global potential function\n\n.\n\n(6)\n\n =\n\nTYt=1\n\nenc,t(rt) \u21e5 card(r1, . . . , rT ) \u21e5 p1(xr1) \u21e5\n\npt(xrt|xrt1)!\nover all possible assignments (r1, . . . , rT ) \u2713 {1, . . . , M}T . To do so, we cast the problem as an\ninteger binary optimization. We de\ufb01ne binary assignment variables {zi,t}t=1,...,T\ni=1,...,M, where zi,t 2\n{0, 1} indicates if xi is a representative of yt. Since each item yt is associated with only a single\nrepresentative, we havePM\ni,i0=1,...,M,\ni=1 zi,t = 1. Also, we de\ufb01ne variables {i}i=1,...,M and {ut\ni0,i 2 {0, 1} indicates if xi0 is a\nwhere i 2 {0, 1} indicates if xi is a representative of y1and ut\nrepresentative of yt given that xi is a representative of yt1. As we will show, {i} and {ut\ni0,i} are\nrelated to {zi,t}, hence, the \ufb01nal optimization only depends on {zi,t}.\nUsing the variables de\ufb01ned above, we can rewrite the global potential function in (6) as\nTYt=2\nTXt=2\nTXt=1\ni0,i} can be written as functions of the\nwhere we used log enc,t(i) = di,t. Notice that {i} and {ut\nassignment variables {zi,t}. Denoting the indicator function by 1(\u00b7), which is one when its argument\ni0,i = 1(rt=i0,rt1=i). Hence, we have\nis true and is zero otherwise, we can write i = 1(r1=i) and ut\n(9)\n\nWe can equivalently maximize the logarithm of , which is to maximize\n\nMYi0=1\nMYi=1\nMXi,i0=1\n\nenc,t(i)zi,t \u21e5 card(r1, . . . , rT ) \u21e5\n\nzi,tdi,t + log card(r1, . . . , rT ) +\n\nTYt=1\nMXi=1\n\nui0,i log pt(xi0|xi),\n\np1(xi)i \u21e5\n\npt(xi0|xi)ut\n\ni0,i.\n\n(7)\n\ni0,i}t=1,...,T\n\ni log p1(xi) +\n\n =\n\nMYi=1\n\nMYi=1\n\nMXi=1\n\nAs a result, we can rewrite the maximization in (8) as the equivalent optimization\n\ni = zi,1, ut\n\ni0,i = zi,t1zi0,t.\n\n(8)\n\nmax\n{zi,t}\n\nMXi=1\nTXt=1\nTXt=2\nMXi,i0=1\n\n+\n\nzi,tdi,t + log card(r1, . . . , rT ) + (\n\nzi,1 log p1(xi)\n\nMXi=1\n\nzi,t1zi0,t log pt(xi0|xi))\n\ns. t. zi,t 2 {0, 1},\n\n4\n\n(10)\n\nMXi=1\n\nzi,t = 1, 8 i, t.\n\n\f\u2713C\n1\n\n\u271311\n\nz11\n\n\u271321\n\nz21\n\n\u271331\n\nz31\n...\n\n\u2713D\n11;12\n\n...\n\n\u2713D\n21;12\n\n...\n\n\u2713D\n31;12\n\n...\n\n\u2713C\n2\n\n\u271312\n\nz12\n\n\u271322\n\nz22\n\n\u271332\n\nz32\n...\n\n\u2713D\n12;13\n\n...\n\n\u2713D\n22;13\n\n...\n\n\u2713D\n32,13\n\n\u2713C\n3\n\n\u2713R\n1\n\n\u2713R\n2\n\n\u2713R\n3\n\n\u271313\n\nz13\n\n\u271323\n\nz23\n\n\u271333\n\nz33\n...\n\n\u00b7\u00b7\u00b7\n\n\u00b7\u00b7\u00b7\n\n\u00b7\u00b7\u00b7\n\n...\n\n\u2713it\n\n\u2713C\nt\n\n\u2318it\n0it,1(t+1)\n\nit\n\nzit\n\n1 )\n\n0it,i0(t +\n\n0it,M (t+ 1)\n\n...\n\n...\n\n\u2713R\ni\n\n\u21b5it\n\nit,1(t+1)\n\n\n\nit,i0(t+1)\n\n\nit,\n\nM\n\n(\nt\n\n+\n\n1\n\n)\n\n...\n\n...\n\n\u2713D\nit;1(t+1)\n\n\u2713D\nit;i0(t+1)\n\n\u2713D\nit;M(t+1)\n\n\u2713D\n1(t1);it\n\n\u2713D\ni0(t1);it\n\n\u2713D\nM(t1);it\n\nFigure 2: Left: Factor graph representing (12). Right: Messages from each factor to a variable node zi,t.\n\nIt is important to note that if xi becomes a representative of some items in Y, then k [zi,1 \u00b7\u00b7\u00b7 zi,T ]k1\nwould be 1. Hence, the number of representatives is given byPM\ni=1 k [zi,1 \u00b7\u00b7\u00b7 zi,T ]k1. As a result,\nwe can rewrite the cardinality potential in (4) as\n\ncard(r1, . . . , rT ) = exp(\n\nk [zi,1 \u00b7\u00b7\u00b7 zi,T ]k1).\n\n(11)\n\nMXi=1\n\nFinally, considering a homogeneous Markov Model on the dynamics of the source set, where\npt(\u00b7|\u00b7) = p(\u00b7|\u00b7), i.e., transitioning from xi as the representative of yt1 to xi0 as the representative of\nyt does not depend on t, we propose to solve the optimization\n\nmax\n{zi,t}\n\nzi,tdi,t  \n\nMXi=1\n\nk [zi,1 \u00b7\u00b7\u00b7 zi,T ]k1 + (\n\nMXi=1\n\nzi,1 log p1(xi)\n\nzi,t1zi0,t log p(xi0|xi))\n\ns. t. zi,t 2 {0, 1},\n\nTXt=1\nMXi=1\nMXi,i0=1\nTXt=2\n\n+\n\n(12)\n\nMXi=1\n\nzi,t = 1, 8 i, t.\n\nIn our proposed formulation above, we assume that the dissimilarities {di,t} and the dynamic models,\ni.e., the probabilities p1(\u00b7) and p(\u00b7|\u00b7), are known. These models can be given by prior knowledge\nor by learning from training data, as we show in the experiments. It is important to notice that the\noptimization in (12) is non-convex, due to binary optimization variables and quadratic terms in the\nobjective function, which is not necessarily positive semi-de\ufb01nite (this can be easily seen when\np(xi0|xi) 6= p(xi|xi0) for some i, i0). In the next section, we treat (12) as a MAP inference on binary\nrandom variables and develop a message passing algorithm to \ufb01nd the hidden values {zi,t}.\nOnce we solve the optimization in (12), we can obtain the representatives as the items of X for which\nzi,t is non-zero for some t. Moreover, we can obtain the segmentation of the sequential items in Y\naccording to their assignments to the representatives. In fact, the sequence of representatives obtained\nby our proposed optimization in (12) not only corresponds to diverse items that well encode the\nsequential target data, but also is compatible with the underlying dynamic of the source data.\n\nRemark 1 Without the dynamic potential, i.e., with  = 0, our proposed optimization in (12) reduces\nto the uncapacitated facility location objective. Hence, our framework generalizes the facility location\nto sequential data by considering transition dynamics among facilities (source set items). On the\nother hand, if we assume uniform distributions for the initial and transition probabilities, the dynamic\nterm (last term) in our objective function becomes a constant, hence, our formulation reduces to the\nuncapacitated facility location. As a result, our framework generalizes the facility location, where we\nconsider arbitrary initial and transition probabilities on X instead of a uniform distribution.\n\n3 Message Passing for Sequential Subset Selection\nIn this section, we develop an ef\ufb01cient message passing algorithm to solve the proposed optimization\nin (12). To do so, we treat the sequential subset selection as a MAP inference, where {zi,t} correspond\nto binary random variables whose joint log-likelihood is given by the objective function in (12). We\nrepresent the log-likelihood, i.e., the objective function in (12), with a factor graph [42], which is\nshown in Figure 2. Recall that a factor graph is a bipartite graph that consists of variable nodes and\n\n5\n\n\ffactor nodes, where every factor evaluates a potential function over variables it is connected to. The\nlog-likelihood is then proportional to the sum of all factor potentials.\nTo form the factors corresponding to the objective function in (12), we de\ufb01ne mi,i0 , log p(xi0|xi)\nand \u00afdi,t , di,t  log p1(xi) if t = 1 and \u00afdi,t , di,t for all other values of t. Denoting zi,: ,\n\u00b7\u00b7\u00b7 zM,t]>, we de\ufb01ne factor potentials corresponding to our\n[zi,1\nframework, shown in Figure 2. More speci\ufb01cally, we de\ufb01ne the encoding and dynamic potentials,\nrespectively, as \u2713i,t(zi,t) ,  \u00afdi,tzi,t and \u2713D\ni,t1;i0,t(zi,t1, zi0,t) , mi,i0zi,t1zi0,t. Moreover we\nde\ufb01ne the cardinality and constraint potentials, respectively, as\n\n\u00b7\u00b7\u00b7 zi,T ]> and z:,t , [z1,t\n\nkzi,:k1 > 0\notherwise\n\n, \u2713C\n\nt (z:,t) ,(0,\n\n1,\n\nPM\n\ni=1 zi,t = 1\n\n.\n\notherwise\n\nThe MAP formulation of our sequential subset selection is then given by\n\n0,\n\n\u2713R\n\ni (zi,:) ,\u21e2,\nMXi=1\nMXi=1\n\n\u2713i,t(zi,t) +\n\nmax\n{zi,t}\n\nTXt=1\n\n\u2713R\ni (zi,:) +\n\n\u2713C\nl (z:,t) + \n\nTXt=1\n\nT1Xt=1\n\nMXi0=1\n\nMXi=1\n\ni,t1;i0,t(zi,t1, zi0,t). (13)\n\u2713D\n\nTo perform MAP inference, we use the max-sum message passing algorithm, which iteratively\nupdates messages between variable and factor nodes in the graph. In our framework, the incoming\nmessages to each variable node zi,t are illustrated in Figure 2. Messages are computed as follows\n(please see the supplementary materials for the derivations).\n\ni,t  \u00afdi,t\ni,t;j,t+1 max{0, mi,j + \u21e2}  max{0, \u21e2}\n0i,t1;j,t max{0, mi,j + \u21e20}  max{0, \u21e20}\n\u2318i,t  max\ni0,t;j,t+1 +\n\nk6=i0,i{\u21b5i0,1  \u00afdi0,1 +\n\nMXj=1\n0j,t1;i0,t}\nMXj=1\nmax{0, \u00afdi,k + \u2318i,k +\nwhere, for brevity of notation, we have de\ufb01ned \u21e2 and \u21e20 as\n\n\u21b5i,t min{0, +Xk6=t\n\nMXj=1\n\n(i,k;j,k+1 + 0j,k1;i,k)}}\n\nMXk=1\n\u21e2 4=  \u00afdj,t+1 + \u21b5j,t+1 + \u2318j,t+1 +\n\u21e20 4=  \u00afdi,t1 + \u21b5i,t1 + \u2318i,t1 +Xk6=i\n\nj,t+1;k,t+2 +Xk6=i\nMXk=1\n\ni,t1;k,t +\n\n0k,t;j,t+1,\n\n0k,t2;i,t1.\n\n(14)\n(15)\n(16)\n\n(17)\n\n(18)\n\n(19)\n\n(20)\n\nThe update of messages continues until convergence, when each variable zi,t is assigned to the value\nthat maximizes the sum of its incoming messages. It is important to note that the max-sum algorithm\nalways converges to the optimal MAP assignment on trees, and has shown good performance on\ngraphs with cycles in many applications, including our work. We also use a dampening factor  2\n[0, 1) on message updates as so that a message \u00b5 is computed as \u00b5(new) \u00b5(old) + (1  )\u00b5(update).\n4 Experiments\nIn this section, we evaluate the performance of our proposed method as well as the state of the art\nfor subset selection on synthetic and real sequential data. For real applications, we consider the task\nof summarizing instructional videos to learn the key steps of the task described in the videos. In\naddition to our proposed message passing (MP) algorithm, we have implemented the optimization\nin (12) using an ADMM framework [43], where we have relaxed the integer binary constraints to\nzi,t 2 [0, 1]. In practice both MP and ADMM algorithms achieve similar results. Hence, we report\nthe performance of our method using the MP algorithm.\nWe compare our proposed method, Sequential Facility Location (SeqFL), with several subset selection\nalgorithms. Since we study the performance of methods as a function of the size of the representative\n\n6\n\n\f50\n40\n30\n20\n10\n0\n\n0\n\n50\n40\n30\n20\n10\n0\n\n0\n\nt\ns\no\nc\n \n\ng\nn\ni\nd\no\nc\nn\nE\n\nt\ns\no\nc\n \nc\ni\nm\na\nn\ny\nD\n*\n\n \n\n \n\n+\ng\nn\ni\nd\no\nc\nn\nE\n\nkDPP\nM-kDPP\nSeq-kDPP\nGreedy\nDS3\nSeqFL\n\n5\n\n15\nNumber of representatives\n\n10\n\n20\n\nkDPP\nM-kDPP\nSeq-kDPP\nGreedy\nDS3\nSeqFL\n\n5\n\n15\nNumber of representatives\n\n10\n\n20\n\nt\ns\no\nc\n \nc\ni\nm\na\nn\ny\nD\n\n300\n\n250\n\n200\n\n150\n\n100\n\n0\n\ne\nr\no\nc\ns\n \ny\nt\ni\ns\nr\ne\nv\ni\nD\n\n1\n0.8\n0.6\n0.4\n0.2\n0\n\n0\n\nkDPP\nM-kDPP\nSeq-kDPP\nGreedy\nDS3\nSeqFL\n\n5\n\n15\nNumber of representatives\n\n10\n\n20\n\nkDPP\nM-kDPP\nSeq-kDPP\nGreedy\nDS3\nSeqFL\n\n5\n\n15\nNumber of representatives\n\n10\n\n20\n\nFigure 3: Encoding cost, dynamic cost, total cost and diversity score of different algorithms as a function of the\nnumber of selected representatives. The size of the source set is M = 50.\n\nset, we use the \ufb01xed-size variant of DPP, called kDPP [44]. In addition to kDPP, we evaluate the\nperformance of Markov kDPP (M-kDPP) [37], in which successive representatives are diverse among\nthemselves and with respect to the previously selected representatives, as well as Sequential kDPP\n(Seq-kDPP) [3], which divides a time-series into multiple windows and successively selects diverse\nsamples from each window conditioned on the previous window.2 We also compare our method\nagainst DS3 [8] and the standard greedy method [28], which optimize the facility location objective\nfunction, which has no dynamic cost, via convex relaxation and greedy selection, respectively.\nTo compare the performance of different methods, we evaluate several costs and scores that\ndemonstrate the effectiveness of each method in terms of encoding, diversity and dynamic com-\npatibility of the set of selected representatives. More speci\ufb01cally, given dissimilarities {di,t},\nthe dynamic model p1(\u00b7) and p(\u00b7|\u00b7), representative set \u21e4, and the assignment of points to rep-\nresentatives {z\u21e4i,t}, we compute the encoding cost as PT\ni=1 di,tz\u21e4i,t, the dynamic cost as\nPM\ni,i0=1 log p(xi0|xi)z\u21e4i,t1z\u21e4i0,t and the total cost as the sum of the\nencoding cost and the dynamic cost multiplied by . We also compute the diversity score as det(K\u21e4),\nwhere K corresponds to the kernel matrix, used in DPP and its variants, and K\u21e4 denotes the subma-\ntrix of K indexed by \u21e4. We use Euclidean distances as dissimilarities and compute the corresponding\ninner-product kernel to run DPPs. Notice that the diversity score, which is the volume of the paral-\nlelotope spanned by the representatives, is what DPP methods aim to (approximately) maximize. As\nDPP methods only \ufb01nd representatives and not assignment of points, we compute z\u21e4i,t\u2019s by assigning\neach point to the closest representative in \u21e4, according to the kernel.\n\ni=1 log p1(xi)z\u21e4i,1 PT\n\nt=2PM\n\nt=1PM\n\n4.1 Synthetic Data\nTo demonstrate the effectiveness of our proposed method for sequential subset selection, we generate\nsynthetic data where for a source set X with M items corresponding to means of M Gaussians, we\ngenerate a transition probability matrix among items and an initial probability vector. We draw a\nsequence of length T from the corresponding Markov model to form the target set Y and run different\nalgorithms to generate k representatives. We then compute the average encoding and transition\ncosts as well as the diversity scores for sequences drawn from the Markov model, as a function of\nk 2 {1, 2, . . . , M}. In the experiments we set M = 50, T = 100. For a \ufb01xed , we run SeqFL for\ndifferent values of  to select different number of representatives.\nFigure 3 illustrates the encoding, dynamic and total costs as well as the diversity scores of different\nmethods, where for SeqFL we have set  = 0.02. Notice that our proposed method consistently\nobtains lower encoding, dynamic and total costs for all numbers of representatives, demonstrating its\neffectiveness for obtaining a sequence of informative representatives that are compatible according\n\n2To have a fair comparison and to select a \ufb01xed number of representatives, we modify the SeqDPP method\n\n[3] and implement Seq-kDPP where k representatives are chosen in each window.\n\n7\n\n\fs\ne\nv\ni\nt\na\nt\nn\ne\ns\ne\nr\np\ne\nr\n \nr\ne\nb\nm\nu\nN\n\n25\n20\n15\n10\n5\n0\n0\n\nt\ns\no\nc\n \nc\ni\nm\na\nn\ny\nD\n\n300\n250\n200\n150\n100\n0\n\n=0\n=0.01\n=0.02\n=0.06\n=0.08\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n=0\n=0.01\n=0.02\n=0.06\n=0.08\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\nt\ns\no\nc\n \n\ng\nn\ni\nd\no\nc\nn\nE\n\n20\n15\n10\n5\n0\n0\n1\n\ne\nr\no\nc\ns\n \n\ny\nt\ni\ns\nr\ne\nv\ni\nD\n\n0.5\n\n0\n0\n\n=0\n=0.01\n=0.02\n=0.06\n=0.08\n12\n\n=0\n=0.01\n=0.02\n=0.06\n=0.08\n12\n\n2\n\n4\n\n6\n\n8\n\n10\n\n2\n\n4\n\n6\n\n8\n\n10\n\nFigure 4: Number of representatives, encoding cost, dynamic cost and diversity score of our proposed method\n(SeqFL) as a function of the parameters (, ).\n\nTask\nChange\n\ntire\nMake\ncoffee\n\nCPR\n\nJump\ncar\nRepot\nplant\n\nAll tasks\n\n(P, R)\nF-score\n(P, R)\nF-score\n(P, R)\nF-score\n(P, R)\nF-score\n(P, R)\nF-score\n(P, R)\nF-score\n\nkDPP\n\n(0.56, 0.50)\n\n0.53\n\n(0.38, 0.33)\n\n0.35\n\n(0.71, 0.71)\n\n0.71\n\n(0.50, 0.50)\n\n0.50\n\n(0.57, 0.67)\n\n0.62\n\n(0.54, 0.54)\n\n0.54\n\nM-kDPP\n(0.55, 0.60)\n\n0.57\n\n(0.50, 0.44)\n\n0.47\n\n(0.71, 0.71)\n\n0.71\n\n(0.56, 0.50)\n\n0.53\n\n(0.60, 0.50)\n\n0.55\n\n(0.58, 0.55)\n\n0.57\n\nSeq-kDPP\n(0.44, 0.40)\n\n0.42\n\n(0.63, 0.56)\n\n0.59\n\n(0.71, 0.71)\n\n0.71\n\n(0.56, 0.50)\n\n0.53\n\n(0.57, 0.67)\n\n0.62\n\n(0.58, 0.57)\n\n0.57\n\nDS3\n\n(0.56, 0.50)\n\n0.53\n\n(0.50, 0.56)\n\n0.53\n\n(0.71, 0.71)\n\n0.71\n\n(0.50, 0.50)\n\n0.50\n\n(0.57, 0.67)\n\n0.62\n\n(0.57, 0.59)\n\n0.58\n\nSeqFL\n\n(0.60, 0.60)\n\n0.60\n\n(0.50, 0.56)\n\n0.53\n\n(0.83, 0.71)\n\n0.77\n\n(0.60, 0.60)\n\n0.60\n\n(0.80, 0.67)\n\n0.73\n\n(0.67, 0.63)\n\n0.65\n\nTable 1: Precision (P), Recall (R) and F-score for the summarization of instructional videos for \ufb01ve tasks.\n\nto the underlying dynamics. It is important to notice that although our method does not maximize\nthe diversity score, used and optimized in kDPP and its variants, it achieves slightly better diversity\nscores (higher is better) than kDPP and M-kDPP. Figure 4 demonstrates the effect of the parameters\n(, ) on the solution of our proposed method. Notice that for a \ufb01xed , as  increases, we select\na smaller number of representatives, hence, the encoding cost increases. Also, for a \ufb01xed , as\n increases, we put more more emphasis on dynamic compatibility of representatives, hence, the\ndynamic cost decreases. On the other hand, the diversity score decreases for smaller , as we select\nmore representatives which become more redundant. The results in Figure 4 also demonstrate the\nrobustness of our method to the change of parameters.\n4.2 Summarization of Instructional Videos\nPeople learn how to perform tasks such as assembling a device or cooking a recipe, by watching\ninstructional videos for which there often exists a large amount of videos on the internet. Summa-\nrization of instructional videos helps to learn the grammars of tasks in terms of key activities or\nprocedures that need to be performed in order to do a certain task. On the other hand, there is a\nlogical way in which the key actions or procedures are connected together, hence, emphasizing the\nimportance of using the dynamic model of data when performing summarization.\nWe apply SeqFL to the task of summarization of intructional videos to automatically learn the\nsequence of key actions to perform a task. We use videos from the instructional video dataset [45],\nwhich consists of 30 instructional videos for each of \ufb01ve activities. The dataset also provides labels\nfor frames which contain the main steps required to perform that task. We preprocess the videos by\nsegmenting each video into superframes [46] and obtain features using a deep neural network that we\nhave constructed for feature extraction for summarization tasks. We use 60% of the videos from each\ntask as the training set to build an HMM model whose states form the source set, X. For each of the\n\n8\n\n\fSeqFL\n\nGive Compression\n\nCheck Breathing\n\nGive Breath Give Compression Give Breath Give Compression\n\nCheck Response\n\nGround\nTruth\nFigure 5: Ground-truth and the automatic summarization result of our method (SeqFL) for the task \u2018CPR\u2019.\n\nGive Breath Give Compression Give Breath Give Compression\n\nOpen Airway\n\nCheck Breathing\n\n1\n\nSeqFL\n\nGround\nTruth\n\nPut Soil\n\nAdd Top\n\nLoosen Root\n\nPlace Plant\n\nAdd Top\n\nPut Soil\n\nTap Pot\n\nTake Plant\n\nLoosen Root\n\nPlace Plant\n\nAdd Top\n\nFigure 6: Ground-truth and the summarization result of our method (SeqFL) for the task \u2018Repot a Plant\u2019.\n\n1\n\n40% remaining videos, we set Y to be the sequence of features extracted from the superframes of the\nvideo. Using the learned dynamic model, we apply our method to summarize each of these remaining\nvideos. The summary for each video is the set of representative elements of X, i.e., selected states\nfrom the HMM model. The assignments of representatives to superframes gives the ordering of\nrepresentatives, i.e., the ordering of performing key actions.\nFor evaluation, we map each representative state into an action label. To do so, we use the ground-truth\nlabels of the training videos, assigning a label to each representative state based on its \ufb01ve nearest\nneighbors in the training set. The summary for each video is an assignment of each superframe in\nthe video to one of the representative action labels. Since each video may have shown each action\nperformed for a different length of time, we remove consecutive repeated labels to form a list of\nactions performed, hence, removing the length of time each action was performed. To construct the\n\ufb01nal summary for each method for a given task, we align the lists of summary actions for all the test\nvideos using the alignment method of [45] for several number of slots. For each method, we choose\nthe number of HMM states and the number of slots for alignment that achieve the best performance.\nGiven ground-truth summaries, we compute the precision, recall and the F-score of various methods\n(see the supplementary materials for details). Table 1 shows the results. Notice that existing methods,\nwhich do not incorporate the dynamic of data for summarization, perform similar to each other for\nmost tasks. In particular, the results show that the sequential diversity promoted by Seq-kDPP and\nM-kDPP is not suf\ufb01cient for capturing the important steps of tasks. On the other hand, for most tasks\nand over the entire dataset, our method (SeqFL) signi\ufb01cantly outperforms other algorithms, better\nproducing the sequence of important steps to perform a task, thanks to the ability of our framework\nto incorporate the underlying dynamics of the data. Figures 5 and 6 show the ground-truth and the\nsummaries produced by our method for the tasks \u2018CPR\u2019 and \u2018Repot a Plant\u2019, respectively. Notice\nthat SeqFL suf\ufb01ciently well captures the main steps and the sequence of steps to perform these tasks.\nHowever, for each task, SeqFL does not capture two of the ground-truth steps. We believe this can be\novercome using larger datasets and more effective feature extraction methods for summarization.\n\n5 Conclusions and Future Work\nWe developed a new framework for sequential subset selection that takes advantage of the underlying\ndynamic models of data, promoting to select a set of representatives that are compatible according to\nthe dynamic models of data. By experiments on synthetic and real data, we showed the effectiveness\nof our method for summarization of sequential data. Our ongoing research include development of\nfast greedy algorithms for our sequential subset selection formulation, investigation of the theoretical\nguarantees of our method, as well as development of more effective summarization-based feature\nextraction techniques and working with larger datasets for the task of instructional data summarization.\n\n9\n\n\fAcknowledgements\nThis work is supported by NSF IIS-1657197 award and startup funds from the Northeastern University,\nCollege of Computer and Information Science.\n\nReferences\n[1] S. Garcia, J. Derrac, J. R. Cano, and F. Herrera, \u201cPrototype selection for nearest neighbor classi\ufb01cation:\nTaxonomy and empirical study,\u201d IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34,\nno. 3, pp. 417\u2013435, 2012.\n\n[2] E. Elhamifar and M. C. De-Paolis-Kaluza, \u201cOnline summarization via submodular and convex optimization,\u201d\n\nin IEEE Conference on Computer Vision and Pattern Recognition, 2017.\n\n[3] B. Gong, W. Chao, K. Grauman, and F. Sha, \u201cDiverse sequential subset selection for supervised video\n\nsummarization,\u201d in Neural Information Processing Systems, 2014.\n\n[4] I. Simon, N. Snavely, and S. M. Seitz, \u201cScene summarization for online image collections,\u201d in IEEE\n\nInternational Conference on Computer Vision, 2007.\n\n[5] H. Lin and J. Bilmes, \u201cLearning mixtures of submodular shells with application to document summariza-\n\ntion,\u201d in Conference on Uncertainty in Arti\ufb01cial Intelligence, 2012.\n\n[6] A. Kulesza and B. Taskar, \u201cDeterminantal point processes for machine learning,\u201d Foundations and Trends\n\nin Machine Learning, vol. 5, 2012.\n\n[7] B. J. Frey and D. Dueck, \u201cClustering by passing messages between data points,\u201d Science, vol. 315, 2007.\n[8] E. Elhamifar, G. Sapiro, and S. S. Sastry, \u201cDissimilarity-based sparse subset selection,\u201d IEEE Transactions\n\non Pattern Analysis and Machine Intelligence, 2016.\n\n[9] E. Elhamifar, G. Sapiro, and R. Vidal, \u201cFinding exemplars from pairwise dissimilarities via simultaneous\n\nsparse recovery,\u201d Neural Information Processing Systems, 2012.\n\n[10] G. Kim, E. Xing, L. Fei-Fei, and T. Kanade, \u201cDistributed cosegmentation via submodular optimization on\n\nanisotropic diffusion,\u201d in International Conference on Computer Vision, 2011.\n\n[11] A. Shah and Z. Ghahramani, \u201cDeterminantal clustering process \u2013 a nonparametric bayesian approach to\n\nkernel based semi-supervised clustering,\u201d in Conference on Uncertainty in Arti\ufb01cial Intelligence, 2013.\n\n[12] R. Reichart and A. Korhonen, \u201cImproved lexical acquisition through dpp-based verb clustering,\u201d in\n\nConference of the Association for Computational Linguistics, 2013.\n\n[13] E. Elhamifar, S. Burden, and S. S. Sastry, \u201cAdaptive piecewise-af\ufb01ne inverse modeling of hybrid dynamical\n\nsystems,\u201d in World Congress of the International Federation of Automatic Control (IFAC), 2014.\n\n[14] E. Elhamifar and S. S. Sastry, \u201cEnergy disaggregation via learning \u2018powerlets\u2019 and sparse coding,\u201d in AAAI\n\nConference on Arti\ufb01cial Intelligence, 2015.\n\n[15] I. Guyon and A. Elisseeff, \u201cAn introduction to variable and feature selection,\u201d Journal of Machine Learning\n\nResearch, 2003.\n\n[16] I. Misra, A. Shrivastava, and M. Hebert, \u201cData-driven exemplar model selection,\u201d in Winter Conference on\n\nApplications of Computer Vision, 2014.\n\n[17] A. Krause, H. B. McMahan, C. Guestrin, and A. Gupta, \u201cRobust submodular observation selection,\u201d\n\nJournal of Machine Learning Research, vol. 9, 2008.\n\n[18] S. Joshi and S. Boyd, \u201cSensor selection via convex optimization,\u201d IEEE Transactions on Signal Processing,\n\nvol. 57, 2009.\n\n[19] J. Hartline, V. S. Mirrokni, and M. Sundararajan, \u201cOptimal marketing strategies over social networks,\u201d in\n\nWorld Wide Web Conference, 2008.\n\n[20] D. McSherry, \u201cDiversity-conscious retrieval,\u201d in Advances in Case-Based Reasoning, 2002.\n[21] R. Duda, P. Hart, and D. Stork, Pattern Classi\ufb01cation. Wiley-Interscience, October 2004.\n[22] M. Aharon, M. Elad, and A. M. Bruckstein, \u201cK-SVD: an algorithm for designing overcomplete dictionaries\n\nfor sparse representation,\u201d IEEE Trans. on Signal Processing, vol. 54, no. 11, pp. 4311\u20134322, 2006.\n\n[23] L. Rabiner, \u201cA tutorial on hidden markov models and selected applications in speech recognition,\u201d Pro-\n\nceedings of the IEEE, vol. 77, 1989.\n\n[24] F. Hadlock, \u201cFinding a maximum cut of a planar graph in polynomial time,\u201d SIAM Journal on Computing,\n\nvol. 4, 1975.\n\n[25] R. Motwani and P. Raghavan, \u201cRandomized algorithms,\u201d Cambridge University Press, New York, 1995.\n\n10\n\n\f[26] J. Carbonell and J. Goldstein, \u201cThe use of mmr, diversity-based reranking for reordering documents and\n\nproducing summaries,\u201d in SIGIR, 1998.\n\n[27] P. B. Mirchandani and R. L. Francis, Discrete Location Theory. Wiley, 1990.\n[28] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher, \u201cAn analysis of approximations for maximizing\n\nsubmodular set functions,\u201d Mathematical Programming, vol. 14, 1978.\n\n[29] E. Elhamifar, G. Sapiro, and R. Vidal, \u201cSee all by looking at a few: Sparse modeling for \ufb01nding representa-\n\ntive objects,\u201d in IEEE Conference on Computer Vision and Pattern Recognition, 2012.\n\n[30] E. Esser, M. Moller, S. Osher, G. Sapiro, and J. Xin, \u201cA convex model for non-negative matrix factorization\nand dimensionality reduction on physical space,\u201d IEEE Transactions on Image Processing, vol. 21, no. 7,\npp. 3239\u20133252, 2012.\n\n[31] A. Borodin and G. Olshanski, \u201cDistributions on partitions, point processes, and the hypergeometric kernel,\u201d\n\nCommunications in Mathematical Physics, vol. 211, 2000.\n\n[32] U. Feige, \u201cA threshold of ln n for approximating set cover,\u201d Journal of the ACM, 1998.\n[33] T. Gonzalez, \u201cClustering to minimize the maximum intercluster distance,\u201d Theoretical Computer Science,\n\nvol. 38, 1985.\n\n[34] A. Civril and M. Magdon-Ismail, \u201cOn selecting a maximum volume sub-matrix of a matrix and related\n\nproblems,\u201d Theoretical Computer Science, vol. 410, 2009.\n\n[35] P. Awasthi, A. S. Bandeira, M. Charikar, R. Krishnaswamy, S. Villar, and R. Ward, \u201cRelax, no need to\nround: Integrality of clustering formulations,\u201d in Conference on Innovations in Theoretical Computer\nScience (ITCS), 2015.\n\n[36] A. Nellore and R. Ward, \u201cRecovery guarantees for exemplar-based clustering,\u201d in Information and Compu-\n\ntation, 2015.\n\n[37] R. H. Affandi, A. Kulesza, and E. B. Fox, \u201cMarkov determinantal point processes,\u201d in Conference on\n\nUncertainty in Arti\ufb01cial Intelligence, 2012.\n\n[38] S. Tschiatschek, A. Singla, and A. Krause, \u201cSelecting sequences of items via submodular maximization,\u201d\n\nAAAI, 2017.\n\n[39] Z. Ghahramani and M. I. Jordan, \u201cFactorial hidden markov models,\u201d Machine Learning, vol. 29, no. 2-3,\n\n1997.\n\n[40] Z. Ghahramani and S. Roweis, \u201cLearning nonlinear dynamical systems using an em algorithm,\u201d NIPS,\n\n2008.\n\n[41] C. Bishop, Pattern Recognition and Machine Learning. New York: Springer, 2007.\n[42] F. Kschischang, B. Frey, and H.-A. Loeliger, \u201cFactor graphs and the sum-product algorithm,\u201d IEEE\n\nTransactions on Information Theory, vol. 47, no. 2, pp. 498\u2013519, 2001.\n\n[43] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, \u201cDistributed optimization and statistical learning\nvia the alternating direction method of multipliers,\u201d Foundations and Trends in Machine Learning, vol. 3,\nno. 1, pp. 1\u2013122, 2010.\n\n[44] A. Kulesza and B. Taskar, \u201ck-dpps: Fixed-size determinantal point processes,\u201d in International Conference\n\non Machine Learning, 2011.\n\n[45] J.-B. Alayrac, P. Bojanowski, N. Agrawal, I. Laptev, J. Sivic, and S. Lacoste-Julien, \u201cUnsupervised learning\n\nfrom narrated instruction videos,\u201d in Computer Vision and Pattern Recognition (CVPR), 2016.\n\n[46] M. Gygli, H. Grabner, H. Riemenschneider, and L. V. Gool, \u201cCreating summaries from user videos,\u201d in\n\nEuropean Conference on Computer Vision, 2014.\n\n11\n\n\f", "award": [], "sourceid": 675, "authors": [{"given_name": "Ehsan", "family_name": "Elhamifar", "institution": "Northeastern University"}, {"given_name": "M. Clara", "family_name": "De Paolis Kaluza", "institution": "Northeastern University"}]}