
Submitted by
Assigned_Reviewer_2
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The authors build on the replicated softmax model to
devise a classification approach that lends itself to the video
classification with only a few examples. The model consists of two parts:
The first part is similar to a replicated softmax model that captures
topics in the features, with the difference of using a rectified linear
units. The second part consists of a discriminant function as a linear
combination of topics. The authors apply a trick to make inference
tractable: they derive a variational bound which is further lower bounded
exploiting the semiconjugacy between the rectified linear units and the
Gaussian likelihood. The model is evaluated on a dataset of videos of
social activities, which are to be classified.
The problem is
important, the model derivation is sound, the experimental evaluation is
correct.
However, the approach is independent (and not motivated)
from the application of unstructured social group activity recognition. It
seems to me that the author developed a general feature learning and
classification approach that should be evaluated against other
classification datasets and further baselines.
Q2: Please summarize your review in 12
sentences
An ok approach, but the approach has not much to do
with social group activity recognition. As a generic feature learning and
classification approach, many comparison to other base lines would
strengthen the work. Submitted by
Assigned_Reviewer_4
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper considers automatic classification of
unstructured social group activity videos. To bridge the semantic gap
between lowlevel features and the classlabels, the authors adopt a
latent topic model based on replicated softmax to extract topics as
midlevel representations for video classification. The main idea of this
paper is the integration of sparse Bayesian learning and replicated
softmax, which leads to the proposed model referred to “relevance topic
model (RTM)”. In RTM, the discriminative topics and sparse classifier
weights are learned jointly, and the authors proposes variational EM
algorithm for model parameter estimation and inference. The authors test
their algorithm on a benchmark dataset and demonstrate better performance
compared to other supervised topic models and some baseline algorithms.
The paper seems to be well organized. It is well motivated and
proposes ideas that are useful in relative area (e.g. video scene
analysis, classification and recognition ). It cites relevant research
papers adequately. It is difficult to understand the significance in
using sparsity on the BOW representation, classifier and hierarchical
prior on the weight. Also, it is not clear what the significant
differences are between the attributes used in the previous research [4]
and the topics discovered by the proposed RTM. It should be noted that the
idea of using Replicated softmax model for extracting latent topics as
midlevel representations is not novel [5]. However, the key idea of this
paper is the joint learning of latent topics and classifier weights, which
is interesting and novel. Section 3.3 contains many complex
mathematical equations, which is not easy to follow. Experimental
results demonstrate good performance of proposed algorithms. However, the
experimental backup is rather weak due to the following two issues; 1) The
performance in terms of generalization is not good. In case of using many
instances (e.g 100 in Table2), the proposed algorithm could not achieve
good results. 2) In Section 4.3, the authors compare the proposed
algorithm with only the baselines in [4]. Since the novelty of the
proposed algorithm in relation to the previous Replicated softmax model
seems to be the joint learning of discriminative topics and sparse
classifier weights, the reviewer suggest using other baselines where the
topics are extracted via previous replicated softmax, but classifier
weights are learned separately from the topics (e.g. latent topic
extraction using replicated softmax[5] + SVM classifier).
Q2: Please summarize your review in 12
sentences
The key idea of this paper is the joint learning of
latent topics and classifier weights, which is interesting and will be
useful in many related research fields (e.g. video context classification
and recognition). In the experiment section, it is recommended to include
additional baseline methods to compare to reveal the strength of the above
key idea. Submitted by
Assigned_Reviewer_5
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
General: The paper proposes a supervised (hybrid)
topic model for unstructured activity recognition. The model is named:
relevance topic model (RTM). The supervised part is the label of the
classes of the training videos.
RTM is an integration of sparse
Bayesian learning and Replicated Softmax. The main concept is to jointly
learn discriminative topics as midlevel video representations and
discriminant function as a video classifier. RTM is composed of an
undirected part to model the marginal distribution of video words and
a directed part to model the conditional distribution of video classes
given the latent topics. Also, the authors propose the parameter
estimation and inference methods. The authors evaluate their algorithm
in the Unstructured Social Activity Attribute (USAA) presenting
quantitative results of RTM. They compare the method with the literature
improving in all cases the activity classification accuracy. Also the
authors show an interesting comparison of the correlation of topics of two
different classes.
The idea seems interesting. The usage of SIFT,
STIP and MFCC which are very low level features mostly based on pixel
representations. It can be interesting a small discussion on how more
descriptive features (object detection) can be added to the model.
Relevant work that could be added in the Bibliography, [1] Similar
subject [2] interesting approach on structured activity discovery.
[1] Social Role Discovery in Human Events Vignesh Ramananthan,
Bangpeng Yao, and Li FeiFei IEEE Conference on Computer Vision and
Pattern Recognition (CVPR). Portland, OR, USA. June 2328, 2013 [2] J.
Varadarajan, R. Emonet and J.M. Odobez Int. Journal of Computer
Vision (IJCV), Vol. 103, Num. 1, pages 100126, May 2013.
Quality:
The quality of the paper is good.
Clarity: The paper is
clear, the problem is well defined.
Originality: The
originality is Medium. This is not the first approach that aims at
building intermediate features for activity recognition. Nor the first one
that uses some full or semisupervision.
Unsupervised methods such
as [2] for structured scenarios have strong assumptions in the temporal
structure of the words (observation vectors) but in unstructured
activities as the ones target in this work temporal constraints can be
relaxed.
Minimal supervision in the learning is acceptable for a
difficult task as activity recognition in wild videos. An interesting
question would be how many is the minimal human labeled data that is
required to classify wild videos decently.
Significance: The
topic is exciting research, I would say of big significance.
Q2: Please summarize your review in 12
sentences
The paper addresses an interesting topic, which is the
recognition of social activities in unconstrained videos. The authors
propose an interesting model, which uses minimal labeling for learning and
discovering topics used for activity recognition. It is well written and
with enough evaluation.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank all the reviewers for their comments and
suggestions. Especially, they consider that we proposed a novel and
effective model (relevance topic model) to address the challenging problem
in unstructured social group activity recognition by jointly learning
discriminative latent topics and a classifier with sparse weights, with a
small number of labeled training instances. We appreciate their positive
assessments to our work, e.g., "idea is interesting and novel"
(R4&R5), "derivation is sound" (R2), "evaluation is correct (R2),
enough (R5), good performance (R4)", "paper is clear, and well organized
and written" (R4&R5), and "of big significance" (R5).
The
following responds to each of main concerns raised by reviewers.
To R2:
1. The approach has not much to do with social
group activity recognition.
>> The proposed RTM is greatly
motivated by two main challenging problems in unstructured social group
activity recognition: 1) the semantic gap between lowlevel visual
features and class labels and 2) the lack of labeled training data.
>> To address these two problems, for the former, RTM
jointly learns latent relevance topics as midlevel representations and a
classifier mapping the midlevel representations to class labels. For the
letter, RTM forces the sparsity on relevance topics by sparse Bayesian
learning to prevent overfitting to specific training instances, which is
clearly validated in Section 4.3.
2. As a generic feature learning
and classification approach, other base lines and experiments on other
datasets would be necessary.
>> Our model mainly focuses on
the task of unstructured social group activity recognition. We will try to
generalize our model to other potential applications in future work.
To R4:
1. Significance of using sparsity.
>>
Using sparsity is to make the relevance topics learned by RTM more
interpretable and discriminative. It provides good generalization and
robustness in the case of a small number of labeled training instances,
which is experimentally justified in Section 4.3.
2. Difference
between the attributes used in [4] and the topics discovered by RTM.
>> [4] models attributes as topics by LDA. It learns a
SEMIlatent topic space because a part of topics are USERDEFINED. The
proposed RTM learns a COMPLETELY latent topic space, which can be easily
generalized to any larger or new datasets.
3. In the case of using
many instances, the performance is not good.
>> When using
many (100) instances, our method is only slightly worse (1.28%) than [4],
which is stateoftheart, but better (4.99%) in the case of 10 instances.
In addition to accuracy, our method learns a lower dimensional latent
semantic space, which provides much more efficient representations.
4. Compare with additional baseline methods where the topics and
the classifier are learned separately, e.g., replicated softmax + SVM.
>> Thanks for your suggestion, and we may include such
experimental results. However, it should be noted that learning topics and
a classifier separately is generally worse than learning jointly. A
theoretical insight is that topic extraction is unsupervised in separate
learning but supervised in joint learning, thus topics extracted in the
latter are more predictive than those in the former. Besides, a similar
case, i.e., LDA + SVM, is worse than joint learning, which has been
demonstrated in [11].
To R5:
1. An interesting question
would be how many is the minimal human labeled data that is required.
>> In our current experiment, 10 would be the minimal
acceptable number for the used dataset. We will make an indepth study on
this problem in future work.
2. Relevant work that could be added
in the Bibliography, [1] Similar subject [2] interesting approach on
structured activity discovery.
>> Thanks for reminding these
two fairly new papers (published after the submission of this paper). We
will cite them.
Thank all reviewers again. We would also like to
address several minor concerns, e.g., clarifying complex mathematical
equations, if the paper is accepted.
 