Reviews: Deep Supervised Summarization: Algorithm and Application to Learning Instructions

The paper is well-written and easy to follow. The main contributions are clear and justified, and they follow quite directly from the concept of using facility location and clustering to do learning. The technical ideas are strongly related to clustering, from the clustering-based loss functions to the EM/K-means flavor of the main algorithm. The experimental results quite thoroughly demonstrate the value of the proposed method on a dataset, though perhaps more datasets could be used. [Update after Author Feedback]. I thank the authors for their detailed responses to the reviewer questions.

Reviewer 2

This paper proposes a sparse convex relation of the facility location utility function for subset selection, for the problem of recovering ground-truth representatives for datasets. This relaxation is used to develop a supervised learning approach for this problem, which involves a learning algorithm that alternatively updates three loss functions (Eq. 7 and Alg. 1) based on three conditions for which this relaxation recovers ground-truth representatives (Theorem 1). The supervised facility learning approach described in this paper appears to be novel, and is described clearly. The experimental results are reasonably convincing overall. One weakness is that only one dataset is used, the Breakfast dataset. It would be more convincing to include results for at least one other dataset. The epoch 0 (before training) TSNE visualization is unnecessary and can be removed. Also, the interpretation of Fig. 3 provided in the paper somewhat subjective and not completely convincing. For example, while the authors point out that the SubModMix and dppLSTM baselines can select multiple representatives from the same activity, SupUFL-L (one of the proposed approaches in the paper) can also select multiple representatives from the same activity. Also, while the baselines fail to select representatives from the “take cup” subactivity, SupUFL-L fails to select representatives from the SIL subactivity. Finally, error bars (confidence estimates) should be provided for the scores in Tables 1 and 2.

Reviewer 3

Minor comment: -Probably a naive comment -- In (7), there may be some trivial \Theta, (say \Theta=0 in some settings) which enforce all f_\Theta(y) to be equal so that all the data points will be assigned to one cluster. In practice, a random initialization of \Theta and an iterative gradient algorithm may result in a reasonably good \Theta, but the problem in (7) is not what is being solved? -In experiments, is it possible to include a tweaked version of [59] as another baseline?

Paper ID:	661
Title:	Deep Supervised Summarization: Algorithm and Application to Learning Instructions

Reviewer 1

Reviewer 2

Reviewer 3