{"title": "Deep Supervised Summarization: Algorithm and Application to Learning Instructions", "book": "Advances in Neural Information Processing Systems", "page_first": 1109, "page_last": 1120, "abstract": "We address the problem of finding representative points of datasets by learning from multiple datasets and their ground-truth summaries. We develop a supervised subset selection framework, based on the facility location utility function, which learns to map datasets to their ground-truth representatives. To do so, we propose to learn representations of data so that the input of transformed data to the facility location recovers their ground-truth representatives. Given the NP-hardness of the utility function, we consider its convex relaxation based on sparse representation and investigate conditions under which the solution of the convex optimization recovers ground-truth representatives of each dataset. We design a loss function whose minimization over the parameters of the data representation network leads to satisfying the theoretical conditions, hence guaranteeing recovering ground-truth summaries. Given the non-convexity of the loss function, we develop an efficient learning scheme that alternates between representation learning by minimizing our proposed loss given the current assignments of points to ground-truth representatives and updating assignments given the current data representation. By experiments on the problem of learning key-steps (subactivities) of instructional videos, we show that our proposed framework improves the state-of-the-art supervised subset selection algorithms.", "full_text": "Deep Supervised Summarization:\n\nAlgorithm and Application to Learning Instructions\n\nChengguang Xu\n\nEhsan Elhamifar\n\nKhoury College of Computer Sciences\n\nKhoury College of Computer Sciences\n\nNortheastern University\n\nBoston, MA 02115\n\nxu.cheng@husky.neu.edu\n\nNortheastern University\n\nBoston, MA 02115\n\neelhami@ccs.neu.edu\n\nAbstract\n\nWe address the problem of \ufb01nding representative points of datasets by learning\nfrom multiple datasets and their ground-truth summaries. We develop a supervised\nsubset selection framework, based on the facility location utility function, which\nlearns to map datasets to their ground-truth representatives. To do so, we propose\nto learn representations of data so that the input of transformed data to the facility\nlocation recovers their ground-truth representatives. Given the NP-hardness of the\nutility function, we consider its convex relaxation based on sparse representation\nand investigate conditions under which the solution of the convex optimization\nrecovers ground-truth representatives of each dataset. We design a loss function\nwhose minimization over the parameters of the data representation network leads\nto satisfying the theoretical conditions, hence guaranteeing recovering ground-\ntruth summaries. Given the non-convexity of the loss function, we develop an\nef\ufb01cient learning scheme that alternates between representation learning by mini-\nmizing our proposed loss given the current assignments of points to ground-truth\nrepresentatives and updating assignments given the current data representation.\nBy experiments on the problem of learning key-steps (subactivities) of instruc-\ntional videos, we show that our proposed framework improves the state-of-the-art\nsupervised subset selection algorithms.\n\n1\n\nIntroduction\n\nSubset selection, which is the task of \ufb01nding a small subset of most informative points from a\nlarge dataset, is a fundamental machine learning task with many applications, including, procedure\nlearning [1, 2, 3], image, video, speech and document summarization [4, 5, 6, 7, 8, 9, 10, 11], data\nclustering [12, 13, 14, 15], feature and model selection [16, 17, 18, 19], social network marketing [20],\nproduct recommendation [21] and sensor placement [22, 23]. Subset selection involves design and\noptimization of utility functions that characterize the informativeness of selected data points, referred\nto as representatives. Different criteria have been studied in the literature, including (sequential)\nfacility location [24, 2, 1] maximum cut [25, 26], maximum marginal relevance [27], sparse coding\n[28, 29] and DPPs [11, 30, 31]. Given that almost all subset selection criteria are, in general, non-\nconvex and NP-hard, approximate methods, such as greedy algorithms for optimizing graph-cuts and\n(sequential) facility location [24, 32, 2], sampling from Determinantal Point Process (DPP) [11, 31]\nand convex relaxation-based methods [12, 33, 29, 34, 35, 36] have been studied in the literature.\nExisting work on subset selection can be divided into the two main categories of unsupervised and\nsupervised methods. The majority of existing research on subset selection falls into the unsupervised\ncategory, where one \ufb01nds representatives of a dataset by optimizing the above criteria [5, 6, 7, 8, 9,\n10, 11, 15, 22, 12, 28, 29] or others, such as diversity or coverage [37, 38, 39, 40, 41], importance\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f[42, 43, 5, 6, 44, 45, 46] and relevance [47, 39, 42, 48, 49]. The results are subsequently evaluated\nqualitatively or quantitatively against ground-truth representatives.\nSupervised Subset Selection. Humans perform remarkably well in summarization of video and\nspeech data, e.g., describe the content of a long complex video by a few sentences or by selecting a\nfew frames/segments. This has motivated the development and study of supervised subset selection\ntechniques that learn from human, with the goal of bringing high-level reasoning and incorporating\nuser preferences into subset selection. More formally, in the supervised setting, given datasets and\ntheir ground-truth representatives, one tries to train subset selection to recover the ground-truth\nsummary of each training dataset and to generalize to new datasets.\nDespite its importance, supervised subset selection has only been more recently studied in the\nliterature [8, 50, 51, 52, 53, 54, 55, 56, 30]. One dif\ufb01culty is that supervised subset selection cannot\nbe naively treated as classi\ufb01cation, since, whether an item receives the label \u2018representative\u2019 or\n\u2018non-representative\u2019 depends on its relationships to the entire data. For example, a representative\ncar image among images of cars, once considered in a dataset of face images will become non-\nrepresentative. To address the problem, [50, 52] try to learn a combination of different criteria, i.e.,\nweights of a mixture of submodular functions. However, deciding about which submodular functions\nand how many to combine is a non-trivial problem, which affects the performance. On the other hand,\n[8, 51, 53, 54, 30, 55] learn a DPP kernel or adapt it to test videos, by maximizing the likelihood of\nthe ground-truth summary under the DPP kernel. However, maximizing the summary likelihood for\nthe ground-truth does not necessarily decrease the likelihood of non-ground-truth subsets.\nDeep Supervised Facility Location. In this paper, we address the problem of supervised subset\nselection based on representation learning for a convex relaxation of the uncapacitated facility location\nfunction. Facility location is a clustering-based subset selection that \ufb01nds a set of representatives\nfor which the sum of dissimilarities from every point to the closest representative is minimized\n[24, 32]. Given the NP-hardness of the problem, different approaches such as convex relaxation\n[57, 29, 12, 35, 36] and greedy submodular maximization [24, 32] have been proposed to ef\ufb01ciently\noptimize this utility function. We use convex relaxation because of an appealing property that\nwe exploit: we show conditions under which the sparse convex relaxation recovers ground-truth\nrepresentatives. We use these conditions to design a loss function to learn representation of data so that\ninputing each transformed dataset to the facility location leads to \ufb01nding ground-truth representatives.\nOur loss function consists of three terms, a medoid loss that enforces each ground-truth representative\nbe the medoid of its associated cluster, an inter-cluster loss that makes sure there is suf\ufb01cient margin\nbetween points in different clusters induced by ground-truth representatives and an intra-cluster\nloss that enforces the distances between points in each cluster be smaller than a margin. The\nlatter two loss functions are based on a margin that depends on the regularization parameter of\nthe uncapacitated facility location and the number of points in induced clusters. The conditions\nand our proposed loss function require knowing the clustering of the data based on assignments to\nground-truth representatives. However, computing the assignments requires access to the optimal\nrepresentation, which is not available. Thus, we propose an optimization scheme that alternates\nbetween updating the representation by minimizing our proposed loss given the current assignments of\npoints to ground-truth representatives and updating the assignments given the current representation.\nWe perform experiments on the problem of supervised instructional video summarization, where\neach video consists of a set of key-steps (subactivities), needed to achieve a given task. In this case,\neach training video comes with a list of representative segments/frames, without knowing the labels\nof representatives and without knowing which representatives across different videos correspond to\nthe same key-step (subactivity), making the supervised subset selection extremely more challenging\nthan classi\ufb01cation. Our experiments on two large datasets of ProceL [1] and Breakfast [58] show the\neffectiveness of our framework.\nRemark 1 Our setting is different than interactive subset selection [59, 60] that incorporates human\nsupervision interactively, i.e., as we run subset selection, we receive and incorporate human feedback\nto improve subset selection. In our case, we do not have human in the loop interactively. Also,\nour setting is different than weakly supervised video summarization [61, 62] that use the name of\nthe video categories or additional web data to perform summarization. We assume each dataset\nhas ground-truth summary and do not use additional web data. Finally, [63] uses facility location\nfor metric learning. However, this requires knowledge about assignments of points to prede\ufb01ned\ncategories, which is a stronger requirement than only knowing the ground-truth representatives.\n\n2\n\n\fRemark 2 To the best of our knowledge, this is the \ufb01rst work on supervised subset selection that\nderives conditions for the exactness of a subset selection utility function (i.e., conditions under which\nsubset selection recovers ground-truth representatives) and employs these conditions to design a loss\nfunction for representation learning, e.g., via DNNs. In fact, this work takes a major step towards a\ntheoretically motivated supervised subset selection framework.\nPaper Organization. The paper is organized as follows. In Section 2, we review the facility location\nand convex relaxation to solve the subset selection ef\ufb01ciently. In Section 3, we show conditions for\nthe equivalence of the two problems, design a new loss function for representation learning whose\nminimum satis\ufb01es the conditions, hence, guaranteeing to obtain ground-truth representatives, and\npropose an ef\ufb01cient learning algorithm. In Section 4, we show experimental results on the ProceL\nand Breakfast datasets for instructional video summarization. Finally, Section 5 concludes the paper.\n2 Background on Subset Selection\nFacility Location. Facility location is a clustering-based subset selection utility function, in which\neach point is assigned to one representative, hence, performing both representative selection and\nclustering [24]. More speci\ufb01cally, assume we have a dataset Y = {y1, . . . , yN} consisting of N\npoints, for which we are given dissimilarities between pairs of points. Let di,j = d(yi, yj) denote\nthe dissimilarity between points yi and yj, with d(\u00b7,\u00b7) being the dissimilarity function. The smaller\nthe di,j is, the better yi represents yj. We assume that dissimilarities are non-negative, provide a\npartial ordering of data and we have djj < dij for every i (cid:54)= j.\nIn order to \ufb01nd representatives, the facility location selects a subset S \u2286 {1, . . . , N} of the data\npoints and assigns each point in Y to the representative point in S with minimum dissimilarity. In\nparticular, the uncapacitated facility location [64, 65] tries to \ufb01nd a subset S with a suf\ufb01ciently small\ncardinality that gives the best encoding of the dataset, i.e.,\n\nN(cid:88)\n\nj=1\n\n3\n\nS\u2286{1,...,N} \u03bb|S| +\n\nmin\n\nmin\ni\u2208S dij,\n\nN(cid:88)\n\n(1)\nwhere \u03bb \u2265 0 is a regularization parameter that sets a trade-off between the number of representatives,\n|S|, and the encoding quality via S. When \u03bb is zero, every point will be a representative of itself.\nSparse Convex Relaxation. Optimizing the facility location in (1) is NP-hard, as it requires searching\nover all possible subsets of the dataset. This has motivated ef\ufb01cient algorithms, including forward-\nbackward greedy submodular maximization with worst case performance guarantees [66] as well\nas sparse convex relaxation [12]. To obtain the convex relaxation, which we use in the paper, one\n\ufb01rst de\ufb01nes assignment variables zij \u2208 {0, 1}, which is 1 when yj is represented by yi and is zero\notherwise. We can rewrite (1) as an equivalent optimization on the assignment variables as\nzij = 1, \u2200i, j,\n\n(2)\nwhere I(\u00b7) is an indicator function, which is one when its argument is nonzero and is zero other-\nwise. Thus, the \ufb01rst term of the objective function measures the number of representatives, since\n[zi1 \u00b7\u00b7\u00b7 ziN ] is nonzero when yi represents some of the data points and becomes zero otherwise. The\nsecond term measures the encoding cost, while the constraints ensure that each point is represented\nby only one representative.\nNotice that (2), which is equivalent to (1), is still an NP-hard problem. Also, (2) is a group-sparse\noptimization where ideally a few vectors [zi1 \u00b7\u00b7\u00b7 ziN ] must be nonzero for a few i\u2019s that would\ncorrespond to the representative points. To obtain an ef\ufb01cient convex relaxation based on group-\nsparsity (for p \u2265 1) [12, 29], we drop the indicator function and relaxe the binary constraints to\nzij \u2208 [0, 1], hence, solve\n\ndijzij s. t. zij \u2208 {0, 1},\n\nI((cid:107)[zi1 \u00b7\u00b7\u00b7 ziN ](cid:107)p) +\n\nN(cid:88)\n\nN(cid:88)\n\nmin{zij} \u03bb\n\ni,j=1\n\ni=1\n\ni=1\n\ni=1\n\nmin{zij} \u03bb\n\n(3)\nWe then obtain the set of representatives R as points yi for which zij is nonzero for some j. Moreover,\nwe obtain a clustering of data according to assignments of points to representatives, where for every\nrepresentative i \u2208 R, we obtain its cluster Gi = {j \u2208 {1, . . . , N}| zij = 1} as the set of all points\nassigned to i.\n\ndijzij\n\ni,j=1\n\ni=1\n\ns. t. zij \u2265 0,\n\nzij = 1, \u2200i, j.\n\nN(cid:88)\n\n(cid:13)(cid:13)[zi1 \u00b7\u00b7\u00b7 ziN ](cid:13)(cid:13)p +\n\nN(cid:88)\n\nN(cid:88)\n\n\f3 Supervised Facility Location\n\nIn this section, we present our proposed approach for supervised subset selection. We discuss\nconditions under which (3), which is the practical and ef\ufb01cient algorithm for solving the uncapacitated\nfacility location, recovers ground-truth representatives from datasets. We use these conditions to\ndesign a loss function for representation learning so that for the transformed data, obtained by\nminimizing the loss, (3) and equivalently (1) will select ground truth summaries of training datasets.\nWe then present an ef\ufb01cient learning framework to optimize our proposed loss function.\n\n3.1 Problem Setting\nAssume we have L datasets and their ground-truth representatives, {(Y(cid:96),R(cid:96))}L\n(cid:96)=1, where Y(cid:96) =\n{y(cid:96),1, . . . , y(cid:96),N(cid:96)\n} denotes N(cid:96) data points in the (cid:96)-th dataset and R(cid:96) \u2286 {1, . . . , N(cid:96)} denotes the\nassociated set of indices of ground-truth representatives. The goal of supervised subset selection is\nto train a subset selection method so that the input of each dataset Y(cid:96) to the trained model leads to\nobtaining ground-truth representatives, R(cid:96).\nIn the paper, we \ufb01x the subset selection method to the uncapacitated facility location in (1) and\nconsider p = \u221e in (2) and (3). We cast the supervised subset selection problem as learning a\ntransformation f\u0398(\u00b7) on the input data so that running the convex algorithm (3) on f\u0398(Y(cid:96)) leads to\nobtaining R(cid:96). We use a deep neural network, parametrized by \u0398, for representation learning and use\nthe Euclidean distance as the measure of dissimilarity, i.e., we de\ufb01ne\n\nd(cid:96)\ni,j\n\n(4)\nNotice that we can use other dissimilarities as well (the theory and learning algorithm below work\nfor other dissimilarities), however, the Euclidean distance results in obtaining an embedding space,\nwhere points are gathered around ground-truth representatives according to (cid:96)2 distances. To learn the\nparameters \u0398, we design a loss function using conditions that guarantee the performance of (3) for\nobtaining ground-truth representatives across datasets.\n\n.\n\n(cid:44)(cid:13)(cid:13)f\u0398(y(cid:96),i) \u2212 f\u0398(y(cid:96),j)(cid:13)(cid:13)2\n\n3.2 Proposed Learning Framework\n\nWe investigate conditions under which the convex algorithm in (3) recovers a given set of points as\nrepresentatives of transformed data {f\u0398(y(cid:96),1), . . . , f\u0398(y(cid:96),N(cid:96)\n)}. We show that under these conditions,\nthe solution of the convex algorithm in (3), which has the constraint zi,j \u2208 [0, 1], will be integer. As a\nresult, the convex relaxation will recover the same solution of the NP-hard non-convex uncapacitated\nfacility location, i.e., the optimality gap between the non-convex and convex formulations vanishes.\nWe then use these conditions to design a loss function for learning the representation parameters \u0398.\nTheorem 1 Consider the convex relaxation of the uncapacitated facility location in (3), with a\n\ufb01xed \u03bb and p = \u221e. Let R(cid:96) be the set of ground-truth representatives from the (cid:96)-th dataset\ni denote the cluster associated with the representative i \u2208 R(cid:96),\n{f\u0398(y(cid:96),1), . . . , f\u0398(y(cid:96),N(cid:96)\ni.e.,\n(5)\nThe optimization (3) recovers R(cid:96) as the set of representatives, if the following conditions hold:\n\ni(cid:48),j = argmini(cid:48) (cid:107)f\u0398(y(cid:96),i(cid:48)) \u2212 f\u0398(y(cid:96),j)(cid:107)2\n\n(cid:9).\n\nG(cid:96)\n\n)} and let G(cid:96)\n\ni =(cid:8)j | i = argmini(cid:48) d(cid:96)\ni , we have(cid:80)\n\ni,j \u2264(cid:80)\n\n1. \u2200i \u2208 R(cid:96), \u2200i(cid:48) \u2208 G(cid:96)\n2. \u2200i \u2208 R(cid:96), \u2200j \u2208 G(cid:96)\n3. \u2200i \u2208 R(cid:96), \u2200i(cid:48), j \u2208 G(cid:96)\n\ni , \u2200i(cid:48) /\u2208 G(cid:96)\n\ni , we have \u03bb|G(cid:96)\ni(cid:48),j \u2264 \u03bb|G(cid:96)\n\ni | + d(cid:96)\ni | + d(cid:96)\n\ni,j.\n\ni , we have d(cid:96)\n\nj\u2208G(cid:96)\n\ni\n\nd(cid:96)\n\nj\u2208G(cid:96)\n\ni\n\nd(cid:96)\ni(cid:48),j;\n\ni,j < d(cid:96)\n\ni(cid:48),j;\n\nThe \ufb01rst condition (medoid condition) states that for points assigned to the cluster of i \u2208 R(cid:96), the\nrepresentative point i must achieve the minimum encoding cost. The second condition (inter-cluster\ncondition) states that the closest point to each cluster from other groups must be suf\ufb01ciently far from\nit. The third condition (intra-cluster condition) states that points in the same cluster must not be far\nfrom each other. For both the inter and intra cluster conditions, the separation margin is given by\n\u03bb/|G(cid:96)\ni|, depending on the regularization parameter and the number of points in each cluster, i.e., we\nhave an adaptive margin to each cluster. Under the conditions of the Theorem 1, we can show that\nthere is no gap between the NP-hard non-convex formulation in (1) and its convex relaxation in (3).\n\n4\n\n\f(cid:96)=1 and ground truth representatives {R(cid:96)}L\n\nAlgorithm 1 : Supervised Facility Location Learning\nInput: Datasets {Y(cid:96)}L\n1: Initialize \u0398 by using a pretrained network;\n2: while (Not Converged) do\nFor \ufb01xed \u0398, compute G(cid:96)\n3:\n2, . . .}L\nFor \ufb01xed {G(cid:96)\n4:\n5: end while\nOutput: Optimal parameters \u0398.\n\n1,G(cid:96)\n\n1,G(cid:96)\n(cid:96)=1, update \u0398 by minimizing the loss function (7);\n\n2, . . . for each dataset (cid:96) via (5);\n\n(cid:96)=1.\n\nCorollary 1 Under the assumptions of the Theorem 1, the convex relaxation in (3) is equivalent to\nthe non-convex uncapacitated facility location optimization in (1), both recovering the same integer\nsolution, where for Y(cid:96), we recover R(cid:96) as the representative set.\nWe can also show similar results for p = 2 (see the supplementary \ufb01le). Next, we use the above result\nto design a loss function for supervised subset selection using the uncapacitated facility location. In\nfact, if we \ufb01nd a representation \u0398 using which the conditions of the Theorem 1 are satis\ufb01ed, then\nnot only the combinatorial optimization in (1) recovers the ground-truth representatives from each\ndataset, but also we obtain the same solution using the ef\ufb01cient sparse optimization in (3). To \ufb01nd the\ndesired \u0398, we propose a loss function that penalizes violation of the conditions of the Theorem 1.\nMore speci\ufb01cally, we de\ufb01ne three loss functions corresponding to the conditions of the theorem, as\n\nL(cid:96)\n\nmedoid(\u0398) (cid:44) (cid:88)\ninter(\u0398) (cid:44) (cid:88)\nintra(\u0398) (cid:44) (cid:88)\n\nL(cid:96)\n\nL(cid:96)\n\ni\u2208R(cid:96)\n\ni\u2208R(cid:96)\n\ni(cid:48)\u2208G(cid:96)\n\n(cid:88)\n(cid:88)\n(cid:88)\n\nj\u2208G(cid:96)\n\ni\n\ni\n\ni\n\nd(cid:96)\n\n(cid:1)\n\nj\u2208G(cid:96)\n\nd(cid:96)\ni(cid:48),j\n\ni,j \u2212 (cid:88)\n(cid:0)(cid:88)\n(cid:88)\n(cid:0) \u03bb\nj\u2208G(cid:96)\ni,j \u2212 d(cid:96)\ni| + d(cid:96)\ni(cid:48),j\n|G(cid:96)\n(cid:0)d(cid:96)\n\ni(cid:48),j \u2212 d(cid:96)\n\ni,j \u2212 \u03bb\ni|\n|G(cid:96)\n\n(cid:1)\n\n+\n\ni\n\ni\n\ni(cid:48) /\u2208G(cid:96)\n\n,\n\n+\n\n,\n\n(cid:1)\n\ni\u2208R(cid:96)\n\ni(cid:48),j\u2208G(cid:96)\n\n,\n\n(6)\n\n+\n\n2,L(cid:96)\nwhere (x)+ (cid:44) max{0, x} is the non-negative thresholding (or ReLU) operator, and L(cid:96)\n3\nmeasure and penalize violation of the medoid, inter-cluster and intra-cluster conditions, respectively,\nfor the dataset (cid:96). Putting the three loss functions together, we propose to minimize the following cost\nfunction, de\ufb01ned over the L datasets,\n\n1,L(cid:96)\n\ni\n\nmedoid(\u0398) + \u03c1interL(cid:96)\n\ninter(\u0398) + \u03c1intraL(cid:96)\n\n(7)\n\nintra(\u0398)(cid:1),\n\nL(\u0398) (cid:44) L(cid:88)\n\n(cid:0)L(cid:96)\n\n(cid:96)=1\n\nmin\n\n\u0398\n\nwhere \u03c1inter, \u03c1intra \u2265 0 are regularization parameters that set a trade-off between the three terms.\nTo minimize L, we need to use the clustering of points in every dataset Y(cid:96) according to assignments of\npoints to the ground-truth representative set R(cid:96), which requires computing G(cid:96)\ni \u2019s. However, computing\nsuch clustering via (5) requires knowledge of the optimal representation of the data \u0398, which is\nnot available. To address the problem, we propose an ef\ufb01cient learning algorithm that alternates\nbetween updating the representation parameters \u0398 by minimizing the proposed loss given the current\nassignments of points to ground-truth representatives and updating the assignments given the current\nrepresentation. Algorithm 1 shows the steps of our learning algorithm.\nNotice that the loss functions naively require considering every representative and every pair of points\nin the same or different clusters. Given the redundancy of points, this is not often needed and we can\nonly sample a few pairs of points in the same or different clusters to compute each loss.\n\nAdaptive Margin. It is important to note that our derived conditions and the loss function make use\nof a margin \u03bb/|Gi| that depends on the facility location hyperparameter and the number of points in\neach cluster Gi. In other words, the margin would be different for different clusters during different\niterations of our learning scheme. More speci\ufb01cally, for a representative that has few points assigned\nto it, the size of the cluster would be small, hence, incurring a larger margin than clusters with more\nnumber of points. This has the following effect: when a cluster has a small number of points, it could\nbe considered as under-sampled, hence, to generalize better to test data, we need to have a better\nseparation from other clusters, i.e., larger margin. On the other hand, for a cluster with a large number\nof points, the margin could be smaller as the chance of changing the distances between and within\n\n5\n\n\fclusters by adding more samples to it would be low. This is in contrast to contrastive loss functions\nthat use a \ufb01xed margin for pairs of dissimilar items, while reducing the distances of similar items as\nmuch as possible. Another difference with respect to contrastive loss functions is that in our loss, we\ncompare the encoding quality of each representative point to non-representative points, whereas in\ncontrastive loss, one uses all pairs of similar and dissimilar items.\nRemark 3 While [35, 36] have shown the integrality of convex relaxation for cardinality-constrained\nfacility location, we showed equivalence conditions for the uncapacitated problem. Moreover, the\nnature of our conditions, as opposed to asymptotic results, allowed to design the loss in (7). Also,\nwe learn to effectively use a common \u03bb across different datasets, which cannot be done in the\ncardinality-constrained case, where the number of ground-truth representatives is already given.\n\n4 Experiments\nIn this section, we evaluate the performance of our method, which we refer to as Supervised Facility\nLocation (SupFL), as well as other algorithms for learning key-steps (subactivities) of instructional\nvideos by learning from ground-truth summaries. Notice that each training dataset comes with a list of\nrepresentative segments/frames, without knowing the labels of representatives and without knowing\nwhich representatives across different videos correspond to the same key-step (subactivity). This\nmakes the supervised subset selection different and extremely more challenging than classi\ufb01cation.\n\n4.1 Experimental Setting\nDataset. We perform experiments on ProceL [1] and Breakfast [58] datasets. The ProceL is a large\nmultimodal dataset of 12 diverse tasks, such as \u2018install Chromecast\u2019, \u2018assemble Clarinet\u2019, \u2018perform\nCPR\u2019. Each task consists of about 60 videos and has a grammar of key-steps, e.g. \u2018perform CPR\u2019\nconsists of \u2018call emergency\u2019, \u2018check pulse\u2019, \u2018open airway\u2019, \u2018give compression\u2019 and \u2018give breath\u2019.\nEach video is annotated with the key-steps. Breakfast is another large dataset of 10 cooking activities\nby 52 individuals performed in 18 different kitchens. The videos are captured using multiple cameras\nwith different view points. Each activity has approximately 200 videos, corresponding to different\nviews of each person doing the same task, hence a total of 1989 videos in the dataset. Similar to\nProceL, each task consists of multiple key-steps (subactivities) required to achieve the task. For\nexample, \u2018making cereal\u2019 consists of \u2018take a bowl\u2019, \u2018pour cereals\u2019, \u2018pour milk\u2019, \u2018stir cereals\u2019, \u2018sil\u2019 (for\nbackground frames at the beginning and the end).\nFor the experiments on ProceL, we split the videos of each task into 70% for training, 15% for\nvalidation and 15% for testing. For the Breakfast, we split the videos of each activity into 60% for\ntraining, 20% for validation, and 20% for testing. We use the middle segment of each subactivity as\nthe ground-truth representative.\nFeature Extraction and Learning. Given the similarity of consecutive frames, we perform the\nsubset selection at the segment level. For ProceL, we use the segments provided in the dataset and for\nBreakfast, we divide each video into segments of 16-frame length with 8 frames overlap between two\nconsecutive segments. We use the C3D network [67] for feature extraction in each segment and use\nthe 4, 096-dimensional feature obtained by the \ufb01rst dense layer after the convolutional layers. We\nconsider two variants of our method: i) SupFL(L), where we learn a linear transformation on the C3D\nfeatures; ii) SupFL(N), where we learn the parameters of a neural network applied to C3D features.\nWe use Euclidean distance for pairwise dissimilarities.\nAlgorithms and Baselines. We compare the two variants of our method, SupFL(L) and SupFL(N),\ndiscussed above, against SubmodMix [52], which learns the weights of a mixture of submodular\nfunctions, and dppLSTM[54], which learns to select representatives using a bidirectional LSTM\ncombined with the DPP kernel, and FCSN [68], which learns the weights of a fully convolutional\nnetwork by treating subset selection as classi\ufb01cation of each segment into representative vs non-\nrepresentative. To show the effectiveness of learning, we also compare with two unsupervised\nbaselines: Uniform, which selects representatives uniformly at random from all segments, and UFL,\nwhich corresponds to running the uncapacitated facility location via the forward-backward greedy\nmethod on dissimilarities computed via C3D features. This particularly allows to investigate the\neffectiveness of our method in taking advantage of ground-truth summaries.\nEvaluation metric. Following [58], we report the segment-wise precision (P), action-wise recall\n(R) and F1 score (F). These metrics help to measure the performance of \ufb01nding a representative\n\n6\n\n\fUniform\n\ndppLSTM SubmodMix\n\nSupFL(L)\n\nSupFL(N)\n\nActivity\nperform CPR\nmake coffee\njump-start car\nrepot plant\nchange tire\ntie a tie\nsetup Chromecast\nchange iPhone battery\nmake pbj sandwich\nmake smoke salmon\nchange toilet seat\nassemble clarinet\nAverage\n\n64.9\n71.6\n71.4\n69.1\n71.2\n60.0\n66.0\n63.2\n64.2\n74.3\n67.5\n70.5\n67.8\nTable 1: Average F1 score (%) of different algorithms for subset selection on the ProceL dataset.\n\nFCSN\n57.4\n64.2\n69.6\n69.2\n65.7\n60.2\n56.8\n59.3\n62.0\n65.3\n68.4\n67.8\n63.8\n\nUFL\n59.7\n62.6\n66.0\n67.3\n68.4\n51.6\n61.7\n55.9\n60.8\n69.4\n61.9\n67.2\n62.7\n\n55.7\n57.3\n57.2\n59.6\n54.6\n44.6\n52.6\n53.0\n52.7\n59.9\n55.5\n57.8\n55.0\n\n53.4\n56.8\n55.8\n64.7\n57.3\n48.1\n55.5\n53.4\n53.2\n62.6\n56.5\n61.7\n56.6\n\n60.0\n62.3\n67.2\n68.2\n65.5\n53.5\n61.8\n61.2\n58.0\n71.4\n62.7\n66.0\n63.2\n\n63.7\n71.5\n68.5\n69.7\n71.0\n58.5\n63.7\n62.3\n64.9\n72.8\n66.0\n72.0\n67.0\n\nfor each key-step and the correctness of video segmentation based on assignments of segments to\nrepresentatives. More speci\ufb01cally, for a video with Ns segments and Na ground-truth key-steps, after\nrunning subset selection we assign each segment to each recovered representative. We compute\n\nP =\n\n\u02c6Ns\nNs\n\n, R =\n\n\u02c6Na\nNa\n\n, F =\n\n2P R\nP + R\n\n,\n\n(8)\n\nwhere \u02c6Ns is the number of the segments that are correctly assigned to representatives, given the\nground-truth assignment labels. \u02c6Na is the number of recovered key-steps in the video via representa-\ntives. The F1 score is the harmonic mean between the segment-wise precision and action-wise recall,\nwhich is between 0 and 1. We report the average of each score over the videos of each task.\nImplementation details. We implemented our framework in Pytorch and used the ADMM frame-\nwork in [12] for subset selection via UFL and our SupFL. We train a model for each individual\nactivity. For SupFL(L), we set the dimension of the transformed data to 1000 and 500 for ProceL\nand Breakfast, respectively, while for SupFL(N) we set the dimension of the network to 4096\u00d7\n1000 \u00d71000 and 4096\u00d7 1000 \u00d7500 for ProceL and Breakfast, respectively, where we use ReLu\nactivations for the second layer. We use stochastic gradient descent to train our model and use 5\nvideos in each minibatch. We use the Adam optimizer with the learning rate of 1e-4 and weight\ndecay of 5e-4. We train our model for at most 50 epochs. In order to improve the training time, after\nwe compute assignments of points to each representative in our alternating algorithm, we randomly\nsample 10 points from each group and use them to form the loss functions in (6). Our method has\nthree hyperparameters (\u03bb, \u03c1inter, \u03c1intra), where \u03bb is the regularization of the UFL in (3), while\n\u03c1inter and \u03c1intra are regularization parameters of our loss function in (7). We set the values of\nhyperparameters using the validation set (we did not perform heavy hyperparameter tuning). In the\nexperiments, we show the effect of the regularization parameters on the performance. To have a\nfair comparison, we run all methods to select the same number of representatives as the number of\nground-truth key-steps in the grammar of the task.\n\n4.2 Experimental Results\n\nTable 1 shows the average F1 score (%) of different methods on each task in the ProceL dataset. Notice\nthat our method outperforms other algorithms, obtaining 67.8% and 67.0% F1 score via SupFL(N)\nand SupFL(L), respectively, over the entire dataset. Compared to the UFL, which is the unsupervised\nversion of our framework, we obtain signi\ufb01cant improvement in all tasks, e.g., improving the F1\nscore by 8.4% and 7.3% for \u2018tie a tie\u2019 and \u2018change iPhone battery\u2019, respectively. dppLSTM, which\nis supervised, does not do as well as our method and other two supervised algorithms. This comes\nfrom the fact that dppLSTM often selects multiple segments from one key-step and from background,\ndue to their appearance diversity, while missing some of the key-steps to choose segments from\n(see Figure 3). While SubmodMix and FCSN perform better than other baselines, their overall\nperformance is about 4% lower than our method. This comes from the fact that SubmodMix has\nlimited learning capacity, depending on which functions to add, while FCSN treats supervised subset\nselection as classi\ufb01cation, hence embeds ground-truth representative segments (class 1) close to each\nother and far from non-representative segments (class 0), which is not desired as a representative and\na non-representative segment could be very similar.\n\n7\n\n\fActivity\ncereals\ncoffee\nfriedegg\njuice\nmilk\npancake\nsalad\nsandwich\nscrambledegg\ntea\nAverage\n\nUniform\n\n58.6\n73.9\n55.2\n61.8\n55.3\n53.1\n57.5\n60.2\n56.8\n69.2\n60.2\n\nUFL\n63.8\n77.7\n53.8\n67.9\n63.4\n53.6\n60.5\n65.6\n61.9\n76.8\n64.5\n\ndppLSTM SubmodMix\n\nSupFL(L)\n\nSupFL(N)\n\n58.3\n78.1\n61.2\n65.6\n54.9\n41.0\n59.3\n61.7\n57.9\n72.6\n61.1\n\n64.6\n79.5\n53.4\n67.7\n63.1\n54.1\n59.4\n65.0\n61.6\n76.1\n64.4\n\n66.3\n82.6\n54.9\n72.9\n65.8\n51.5\n64.5\n69.1\n63.6\n78.1\n66.9\n\n63.4\n80.5\n59.7\n71.9\n63.9\n53.3\n61.2\n67.0\n59.6\n76.3\n65.7\n\nTable 2: Average F1 score (%) of different algorithms for subset selection on the Breakfast dataset.\n\nFigure 2: Effect of hyperparameters on the average F1 score (%) over all tasks in the ProceL dataset.\n\nTable 2 shows the average F1 score (%) in the Breakfast dataset1. While both versions of our\nmethod outperform other algorithms, in contrast to the ProceL, SupFL(L) generally does better\nthan SupFL(N). Moreover, the gap between the performance of UFL and SupFL is smaller. This\ncomes from the fact that the C3D features capture discriminative information for separating different\nkey-steps (subactivities), hence, learning a linear transformation generally does better than a nonlinear\none and less improvement will be expected by learning from ground-truth summaries.\nFigure 1 shows the average F1 score improvement over\nnot learning data representation on the test videos of the\nfour tasks of \u2018perform CPR\u2019, \u2018change iPhone battery\u2019,\n\u2018make coffee\u2019 and \u2018change tire\u2019 in ProceL as a function of\nthe number of training epochs. Notice that generally as\nthe training continues the F1 score improves, obtaining\nbetween 4% and 10% improvement, depending on the task,\nover using C3D features.\nHyperparameter Effect. We also analyze the perfor-\nmance of our method as a function of the regularization\nparameters (\u03bb, \u03c1inter, \u03c1intra), where \u03bb corresponds to the\nregularization parameter of the uncapacitated FL utility\nfunction in (3), while \u03c1inter, \u03c1intra correspond to the hy-\nperparameters that set a trade off between the three terms\nof our loss function in (7). Figure 2 shows the F1 score on the ProceL dataset, where to see the effect\nof each hyperparameter, we have \ufb01xed the values of the other two (these \ufb01xed values depend on the\ntask). Notice that the F1 score is relatively stable with respect to the hyperparameter change. In\nparticular, changing \u03bb from 0.001 to 0.1 the performance over the dataset changes by at most 1.2%\nin F1 score, while changing \u03c1inter and \u03c1intra from 0.01 to 10, the performance changes by at most\n0.6% and 2.1%, respectively.\nAblation Studies. To show the effectiveness of using all three loss functions in our proposed cost\nfunction in (7), we perform ablation studies. Table 3 shows the average precision, recall and F1\nscores on the ProceL dataset. Notice that when we use only one loss or a combination of two loss\nfunctions, we achieve relatively similar low scores, being about 7% lower than using the three loss\nfunctions together. This shows that, as expected from the theoretical results, we need to use all\n\nFigure 1: F1 improvement during training\non test videos from four tasks in ProceL.\n\n1FCSN on Breakfast produced signi\ufb01cantly lower F1 scores compared to all other baselines.\n\n8\n\n051015Epoch0246810F1 Improvement (%)CPRiPhoneCoffeeTire\fFigure 3: Qualitative results on two test videos from the tasks \u2018make smoke salmon sandwich\u2019 (left) and\n\u2018change iPhone battery\u2019 (right). Compared to the baselines, our method recovers more number of representatives\ncorresponding to ground-truth key-steps.\n\nSupFL\nmedoid loss\ninter-cluster loss\nintra-cluster loss\nmedoid + inter-cluster loss\nmedoid + intra-cluster loss\ninter-cluster + intra-cluster loss\nmedoid + inter-cluster + intra-cluster loss\n\nPrecision Recall\n61.4\n60.2\n57.5\n60.6\n60.4\n57.5\n66.3\n\n68.1\n66.2\n67.2\n68.2\n68.4\n64.7\n72.8\n\nF1 score\n\n61.6\n59.9\n59.3\n61.2\n61.8\n58.2\n67.8\n\nTable 3: Average performance our method, SupFL(N), on ProceL with different combinations of loss functions.\n\nloss functions corresponding to the three theoretical conditions in order to effectively learn from\nground-truth summaries. Also, notice that the medoid loss alone or its combination with either of the\ntwo other losses obtains slightly better performance than using the inter-cluster or intra cluster loss or\ntheir combination. This is expected as the medoid loss tries to center points around each ground-truth\nrepresentative. Finally, the combination of the inter-cluster and intra-cluster loss, which has weak\nresemblance to the contrastive loss, does not do well in the supervised subset selection problem.\nQualitative Results. Figure 3 shows a qualitative result of running different methods for two videos\nfrom the two tasks of \u2018change iPhone battery\u2019 and \u2018make smoke salmon sandwich\u2019 from the ProceL\ndataset, where all methods choose the same number of representatives (for clarity, we do not show\nrepresentatives obtained from background). Notice that for \u2018smoke salmon sandwich\u2019 our method\ncorrectly \ufb01nds representatives from all key-steps, while other methods miss one of the key-steps.\nSimilarly, for \u2018change iPhone screen\u2019, our method is more successful than baselines, which miss 5 or\n6 key-steps. Our method in general does better in obtaining diverse representative segments, while\nother supervised baselines often obtain multiple redundant representatives from the same key-step.\n5 Conclusions\nWe addressed the problem of supervised subset selection by generalizing the facility location to learn\nfrom ground-truth summaries. We considered an ef\ufb01cient sparse optimization of the uncapacitated\nfacility location and investigated conditions under which it recovers ground-truth representatives\nand also becomes equivalent to the original NP-hard problem. We designed a loss function and an\nef\ufb01cient framework to learn representations of data so that the input of transformed data to the facility\nlocation satis\ufb01es the theoretical conditions, hence, recovers ground-truth summaries. We showed\nthe effectiveness of our method for recovering key-steps of instructional videos. To the best of our\nknowledge, this is the \ufb01rst work on supervised subset selection that derives conditions under which\nsubset selection recovers ground-truth representatives and employs them to design a loss function\nfor deep representation learning. We believe that this work took a major step towards a theoretically\nmotivated supervised subset selection framework.\nAcknowledgements\nThis work is supported by DARPA Young Faculty Award (D18AP00050), NSF (IIS-1657197), ONR\n(N000141812132) and ARO (W911NF1810300). Chengguang Xu would like to thank Dat Huynh\nand Zwe Naing for their help and advice with some of the implementations during his research\nassistantship at MCADS lab, which resulted in this work.\n\n9\n\n\fReferences\n[1] E. Elhamifar and Z. Naing, \u201cUnsupervised procedure learning via joint dynamic summarization,\u201d Interna-\n\ntional Conference on Computer Vision, 2019.\n\n[2] E. Elhamifar, \u201cSequential facility location: Approximate submodularity and greedy algorithm,\u201d Interna-\n\ntional Conference on Machine Learning, 2019.\n\n[3] E. Elhamifar and M. C. De-Paolis-Kaluza, \u201cSubset selection and summarization in sequential data,\u201d Neural\n\nInformation Processing Systems, 2017.\n\n[4] \u2014\u2014, \u201cOnline summarization via submodular and convex optimization,\u201d IEEE Conference on Computer\n\nVision and Pattern Recognition, 2017.\n\n[5] Z. Lu and K. Grauman, \u201cStory-driven summarization for egocentric video,\u201d IEEE Conference on Computer\n\nVision and Pattern Recognition, 2013.\n\n[6] A. Khosla, R. Hamid, C. J. Lin, and N. Sundaresan, \u201cLarge-scale video summarization using web-image\n\npriors,\u201d IEEE Conference on Computer Vision and Pattern Recognition, 2013.\n\n[7] A. Sharghi, B. Gong, and M. Shah, \u201cQuery-focused extractive video summarization,\u201d European Conference\n\non Computer Vision, 2016.\n\n[8] B. Gong, W. Chao, K. Grauman, and F. Sha, \u201cDiverse sequential subset selection for supervised video\n\nsummarization,\u201d Neural Information Processing Systems, 2014.\n\n[9] I. Simon, N. Snavely, and S. M. Seitz, \u201cScene summarization for online image collections,\u201d IEEE Interna-\n\ntional Conference on Computer Vision, 2007.\n\n[10] H. Lin and J. Bilmes, \u201cLearning mixtures of submodular shells with application to document summariza-\n\ntion,\u201d Conference on Uncertainty in Arti\ufb01cial Intelligence, 2012.\n\n[11] A. Kulesza and B. Taskar, \u201cDeterminantal point processes for machine learning,\u201d Foundations and Trends\n\nin Machine Learning, vol. 5, 2012.\n\n[12] E. Elhamifar, G. Sapiro, and S. S. Sastry, \u201cDissimilarity-based sparse subset selection,\u201d IEEE Transactions\n\non Pattern Analysis and Machine Intelligence, 2016.\n\n[13] G. Kim, E. Xing, L. Fei-Fei, and T. Kanade, \u201cDistributed cosegmentation via submodular optimization on\n\nanisotropic diffusion,\u201d International Conference on Computer Vision, 2011.\n\n[14] A. Shah and Z. Ghahramani, \u201cDeterminantal clustering process \u2013 a nonparametric bayesian approach to\n\nkernel based semi-supervised clustering,\u201d Conference on Uncertainty in Arti\ufb01cial Intelligence, 2013.\n\n[15] B. J. Frey and D. Dueck, \u201cClustering by passing messages between data points,\u201d Science, vol. 315, 2007.\n[16] E. Elhamifar, S. Burden, and S. S. Sastry, \u201cAdaptive piecewise-af\ufb01ne inverse modeling of hybrid dynamical\n\nsystems,\u201d World Congress of the International Federation of Automatic Control (IFAC), 2014.\n\n[17] E. Elhamifar and S. S. Sastry, \u201cEnergy disaggregation via learning \u2018powerlets\u2019 and sparse coding,\u201d AAAI\n\nConference on Arti\ufb01cial Intelligence, 2015.\n\n[18] I. Guyon and A. Elisseeff, \u201cAn introduction to variable and feature selection,\u201d Journal of Machine Learning\n\nResearch, 2003.\n\n[19] I. Misra, A. Shrivastava, and M. Hebert, \u201cData-driven exemplar model selection,\u201d Winter Conference on\n\nApplications of Computer Vision, 2014.\n\n[20] J. Hartline, V. S. Mirrokni, and M. Sundararajan, \u201cOptimal marketing strategies over social networks,\u201d\n\nWorld Wide Web Conference, 2008.\n\n[21] D. McSherry, \u201cDiversity-conscious retrieval,\u201d Advances in Case-Based Reasoning, 2002.\n[22] A. Krause, H. B. McMahan, C. Guestrin, and A. Gupta, \u201cRobust submodular observation selection,\u201d\n\nJournal of Machine Learning Research, vol. 9, 2008.\n\n[23] S. Joshi and S. Boyd, \u201cSensor selection via convex optimization,\u201d IEEE Transactions on Signal Processing,\n\nvol. 57, 2009.\n\n[24] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher, \u201cAn analysis of approximations for maximizing\n\nsubmodular set functions,\u201d Mathematical Programming, vol. 14, 1978.\n\n[25] F. Hadlock, \u201cFinding a maximum cut of a planar graph in polynomial time,\u201d SIAM Journal on Computing,\n\nvol. 4, 1975.\n\n[26] R. Motwani and P. Raghavan, \u201cRandomized algorithms,\u201d Cambridge University Press, New York, 1995.\n[27] J. Carbonell and J. Goldstein, \u201cThe use of mmr, diversity-based reranking for reordering documents and\n\nproducing summaries,\u201d SIGIR, 1998.\n\n[28] E. Elhamifar, G. Sapiro, and R. Vidal, \u201cSee all by looking at a few: Sparse modeling for \ufb01nding representa-\n\ntive objects,\u201d IEEE Conference on Computer Vision and Pattern Recognition, 2012.\n\n10\n\n\f[29] E. Esser, M. Moller, S. Osher, G. Sapiro, and J. Xin, \u201cA convex model for non-negative matrix factorization\nand dimensionality reduction on physical space,\u201d IEEE Transactions on Image Processing, vol. 21, no. 7,\npp. 3239\u20133252, 2012.\n\n[30] A. Sharghi, A. Borji, C. Li, T. Yang, and B. Gong, \u201cImproving sequential determinantal point processes for\n\nsupervised video summarization,\u201d European Conference on Computer Vision, 2018.\n\n[31] A. Borodin and G. Olshanski, \u201cDistributions on partitions, point processes, and the hypergeometric kernel,\u201d\n\nCommunications in Mathematical Physics, vol. 211, 2000.\n\n[32] A. Krause and D. Golovin, \u201cSubmodular function maximization.\u201d Cambridge University Press, 2014.\n[33] E. Elhamifar, G. Sapiro, and R. Vidal, \u201cFinding exemplars from pairwise dissimilarities via simultaneous\n\nsparse recovery,\u201d Neural Information Processing Systems, 2012.\n\n[34] R. Panda and A. K. Roy-Chowdhury, \u201cCollaborative summarization of topic-related videos,\u201d IEEE Confer-\n\nence on Computer Vision and Pattern Recognition, 2017.\n\n[35] P. Awasthi, A. S. Bandeira, M. Charikar, R. Krishnaswamy, S. Villar, and R. Ward, \u201cRelax, no need to\nround: Integrality of clustering formulations,\u201d Conference on Innovations in Theoretical Computer Science\n(ITCS), 2015.\n\n[36] A. Nellore and R. Ward, \u201cRecovery guarantees for exemplar-based clustering,\u201d Information and Computa-\n\ntion, 2015.\n\n[37] H. J. Zhang, J. Wu, D. Zhong, and S. W. Smoliar, \u201cAn integrated system for content-based video retrieval\n\nand browsing,\u201d Pattern recognition, vol. 30, no. 4, 1997.\n\n[38] T. Liu and J. R. Kender, \u201cOptimization algorithms for the selection of key frame sequences of variable\n\nlength,\u201d European Conference on Computer Vision, 2002.\n\n[39] Y. Li and B. Merialdo, \u201cMulti-video summarization based on video-mmr,\u201d WIAMIS Workshop, 2010.\n[40] S. E. F. de Avila, A. P. B. Lopes, A. da Luz, and A. de Albuquerque Araujo, \u201cVsumm: A mechanism\ndesigned to produce static video summaries and a novel evaluation method,\u201d Pattern Recognition Letters,\nvol. 32, 2011.\n\n[41] B. Zhao and E. P. Xing, \u201cQuasi real-time summarization for consumer videos,\u201d IEEE Conference on\n\nComputer Vision and Pattern Recognition, 2014.\n\n[42] R. Hong, J. Tang, H. K. Tan, S. Yan, C. Ngo, and T. Chua, \u201cEvent driven summarization for web videos,\u201d\n\nSIGMM Workshop, 2009.\n\n[43] Y. J. Lee, J. Ghosh, and K. Grauman, \u201cDiscovering important people and objects for egocentric video\n\nsummarization,\u201d IEEE Conference on Computer Vision and Pattern Recognition, 2012.\n\n[44] C. W. Ngo, Y. F. Ma, and H. Zhang, \u201cAutomatic video summarization by graph modeling,\u201d International\n\nConference on Computer Vision, 2013.\n\n[45] G. Kim and E. P. Xing, \u201cReconstructing storyline graphs for image recommendation from web community\n\nphotos,\u201d IEEE Conference on Computer Vision and Pattern Recognition, 2014.\n\n[46] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, \u201cTvsum: Summarizing web videos using titles,\u201d CVPR,\n\n2015.\n\n[47] H. W. Kang, Y. Matsushita, X. Tang, and X. Q. Chen, \u201cSpace-time video montage,\u201d IEEE Conference on\n\nComputer Vision and Pattern Recognition, 2006.\n\n[48] Y. F. Ma, L. Lu, H. J. Zhang, and M. Li, \u201cA user attention model for video summarization,\u201d ACM\n\nMultimedia, 2002.\n\n[49] W. S. Chu, Y. Song, and A. Jaimes, \u201cVideo co-summarization: Video summarization by visual co-\n\noccurrence,\u201d IEEE Conference on Computer Vision and Pattern Recognition, 2015.\n\n[50] S. Tschiatschek, R. Iyer, H. Wei, and J. Bilmes, \u201cLearning mixtures of submodular functions for image\n\ncollection summarization,\u201d Neural Information Processing Systems, 2014.\n\n[51] J. Gillenwater, A. Kulesza, E. Fox, and B. Taskar, \u201cExpectation-maximization for learning determinantal\n\npoint processes,\u201d Neural Information Processing Systems, 2014.\n\n[52] M. Gygli, H. Grabner, , and L. V. Gool, \u201cVideo summarization by learning submodular mixtures of\n\nobjectives,\u201d IEEE Conference on Computer Vision and Pattern Recognition, 2015.\n\n[53] W. L. Chao, B. Gong, K. Grauman, and F. Sha, \u201cLarge-margin determinantal point processes,\u201d Uncertainty\n\nin Arti\ufb01cial Intelligence, 2015.\n\n[54] K. Zhang, W. Chao, F. Sha, and K. Grauman, \u201cVideo summarization with long short-term memory,\u201d\n\nEuropean Conference on Computer Vision, 2016.\n\n[55] K. Zhang, W. L. Chao, F. Sha, and K. Grauman, \u201cSummary transfer: Exemplar-based subset selection for\n\nvideo summarization,\u201d IEEE Conference on Computer Vision and Pattern Recognition, 2016.\n\n11\n\n\f[56] D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid, \u201cCategory-speci\ufb01c video summarization,\u201d European\n\nConference on Computer Vision, 2014.\n\n[57] M. Charikar, S. Guha, E. Tardos, and D. B. Shmoys, \u201cA constant-factor approximation algorithm for the\n\nk-median problem,\u201d Journal of Computer and System Sciences, vol. 65, 2002.\n\n[58] H. Kuehne, A. Arslan, and T. Serre, \u201cThe language of actions: Recovering the syntax and semantics of\n\ngoal-directed human,\u201d IEEE Conference on Computer Vision and Pattern Recognition, 2014.\n\n[59] A. Guillory and J. Bilmes, \u201cInteractive submodular set cover,\u201d International Conference on Machine\n\nLearning, 2010.\n\n[60] A. G. del Molino, X. Boix, J. H. Lim, and A. H. Tan, \u201cActive video summarization: Customized summaries\n\nvia on-line interaction with the user,\u201d AAAI, 2017.\n\n[61] R. Panda, A. Das, Z. Wu, J. Ernst, and A. K. Roy-Chowdhury, \u201cWeakly supervised summarization of web\n\nvideos,\u201d International Conference on Computer Vision, 2017.\n\n[62] S. Cai, W. Zuo, L. S. Davis, and L. Zhang, \u201cWeakly-supervised video summarization using variational\n\nencoder-decoder and web prior,\u201d European Conference on Computer Vision, 2018.\n\n[63] H. O. Song, S. Jegelka, V. Rathod, and K. Murphy, \u201cDeep metric learning via facility location,\u201d IEEE\n\nConference on Computer Vision and Pattern Recognition, 2017.\n\n[64] N. Lazic, B. J. Frey, and P. Aarabi, \u201cSolving the uncapacitated facility location problem using message\n\npassing algorithms,\u201d International Conference on Arti\ufb01cial Intelligence and Statistics, 2007.\n\n[65] S. Li, \u201cA 1.488 approximation algorithm for the uncapacitated facility location problem,\u201d Automata,\n\nLanguages and Programming, 2011.\n\n[66] N. Buchbinder, M. Feldman, J. Naor, and R. Schwartz, \u201cA tight linear time (1/2)-approximation for\nunconstrained submodular maximization,\u201d Annual Symposium on Foundations of Computer Science, 2012.\n[67] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, \u201cLearning spatiotemporal features with 3d\n\nconvolutional networks,\u201d in International Conference on Computer Vision, 2015.\n\n[68] M. Rochan, L. Ye, and Y. Wang, \u201cVideo summarization using fully convolutional sequence networks,\u201d\n\nEuropean Conference on Computer Vision, 2018.\n\n12\n\n\f", "award": [], "sourceid": 661, "authors": [{"given_name": "Chengguang", "family_name": "Xu", "institution": "Northeastern University"}, {"given_name": "Ehsan", "family_name": "Elhamifar", "institution": "Northeastern University"}]}