Paper ID: | 159 |
---|---|
Title: | Combining Low-Density Separators with CNNs |
This paper proposes a method for improving the performance of CNNs as feature extractors, through an unsupervised training step. In order to make the representations learned by CNNs more general, the authors propose low-density separator modules which aim to find such splits in visual space that keep high-density regions intact and split low-density regions. This is accomplished by relying on quasi class labels, and jointly optimizing both the generated quasi-classes and the splits in them. The splits or decision boundaries are the weights in a k-th neural network layer, using the 1-st through k-1-st layers as a feature space. The authors demonstrate good performance on four datasets.
Technical quality: The paper could benefit from a little more discussion of the methods. - For example, why is the joint quasi-class generation and split discovery beneficial, as opposed to fixing the quasi classes based on some external information, just once? - Could you do the unsupervised training stage before the supervised one? - Also, what is the rationale behind the choice of datasets on which to test the method? - What's the evidence for the statement on L133? - How about datasets like ImageNet or MSCOCO? - Why was video data used for the unsupervised pre-training step? Novelty: Low-density separators have been proposed before but to my knowledge not in the context of CNNs. Impact: It is very appealing to propose an unsupervised improvement of the feature representations learned by CNNs, since as the authors argue, in many real-world applications, we don't have the luxury of simply obtaining lots of labeled data. If code is provided, this work will be useful to many. Clarity: Could be improved a bit. Notation could maybe be improved as it's a bit cluttered and hard to read, and the third and sixth terms in Eq. 3 could be explained better / in more detail. Some more comments: - I'm not sure Figure 1 helps one's understanding of the idea. - L152: Compared to what? - L164: What if a different non-linearity is used? - Figure 4 would be much more readable if some color-coding of the bars is used.
2-Confident (read it all; understood it all reasonably well)
The paper presents a method for unsupervised pre-training of convolutional neural networks, which is one of the hot topics nowadays. The idea of the method is to learn so called low-density seperators, which can be thought of seperating hyperplanes in areas with low data density. The criterion is reasonable and very much related to clustering as pre-training method. Learning low-density seperators is formulated as learning hyperplanes between pseudo classes. Pseudo classes are found jointly together with the hyperplanes in a combined optimization hyperplane. Goal of the work is to increase the generalization capability of CNNs especially in cases with only a few annotated examples.
Pros: (1) The paper is well written. However, the main difficulty in terms of readability are the numerous references to the supplementary material and the missing clear overview of the approach itself. (2) I like the approach, since it is novel, innovative and well reported. (3) References are reasonable. Cons (1) Minor: What kind of distance metric is used in Eq. 2? This should be documented. (2) Minor: Parameters are documented but not well motivated (I know that this is tricky). (3) Minor: Related work in L56 to L88 should be highlighted as such. (4) The approach is rather an unsupervised finetuning method than a pre-training method. The authors also only try the method for later layers. It would be interesting to clearly see the limits of the method. (5) The paper would significantly benefit from a clear overview figure showing the pipeline of the approach. It is quite hard to understand it by reading the paper, since the terms "unsupervised pre-training" are used in the beginning, but later on an SVM is used for classification. (6) Is it possible to train the resulting architecture in an end-to-end manner? (7) Line 124/125 mentions that it is a clear benefit of the approach that the membership of each data point does not have to be estimated. However, this is not true, since the membership matrix T is also estimated in the proposed method. (8) There should be a baseline where the clustering is not done jointly together with learning the hyperplanes. (9) What are the computational challenges of the approach? The only statement in L218 suggests that the method might be indeed computationally demanding. (10) Statement L319 should be proven.
2-Confident (read it all; understood it all reasonably well)
The paper describes an method for improving the generalizability of features from pre-trained convolutional neural networks. It introduces a new low-density separator layer that is trained in an unsupervised manner to discriminate between quasi-classes on a very large dataset of unlabeled images. The quasi-classes are selected from the data by first selecting a small number of examples using the k-means++ heuristic, and then expanding these iteratively using linear support vector machines. Given the quasi-classes, which are assumed to correspond to high density regions in the feature space, low density separators are then fit by optimizing an objective designed to produce separators that separate regions with small intra-class distances and large inter-class distances. The experiments show that the resulting features give impressive results on several datasets when there are few labeled training examples, outperforming features from a vanilla pre-trained CNN and other state-of-the-art (DAG-CNN).
The problem of exploiting large datasets of unlabeled images to improve representations is widely acknowledged as being extremely important, and this work makes an important step in this direction. The approach in the paper still requires a CNN that has been pre-trained on a large set of labeled images, but shows that more generalizable representations can be extracted from such a network using a post-hoc unsupervised training process on a larger sets of unlabeled images. The idea of automatically identifying and using pseudo-classes for unsupervised training on a very large dataset comes from the authors previous ECCV submission [17] given in the supplementary material, with the main delta being that the method presented here is more general, and it allows for fine tuning the full network with backprop when sufficient examples are available. Personally, I think the delta is sufficient, and the results on standard datasets when there are few labeled examples clearly demonstrate the effectiveness of the approach. In general, I think that the problem being addressed is important, that the idea is novel, the paper is well written and clear, and the results are impressive. The supplementary material is a valuable addition. Issues: It wasn't clear to me from the paper how exactly recursive feature elimination was done: are the most active features selected using the YFCC100M dataset or the target dataset? Also, I would expect that PCA would work better than random projections: was this attempted too? The paper uses the term "unsupervised pre-training" to refer to the process of identifying the low dimensional separators. I found this to be a little confusing, as the term is often used to refer to methods like stacked autoencoders and RBMs for initializing weights when training deep nets. Here, the pre-training actually happens after the network has undergone supervised training on a labeled dataset. Furthermore, the "pre-training" step is not always followed by fine-tuning (i.e. when only a very small number of examples are available, and the features are used off the shelf), which makes is even more misleading. Including some information (perhaps in the supplementary material) on how long the training took would be useful. In Figure 4 you give the classification accuracies for a single scale CNN. Since the single scale LDS+CNN replaces the last CNN layer, it would be good to see the performance of the single scale CNN for features from both the last and penultimate layers in these figures (i.e. fc6 and fc7). Corrections: - Line 100: "An unit" -> "A unit" - Line 207: "SVM's" -> "SVMs" - Line 286: "significant" -> "significantly" - Figure 3: "categary" -> "category"
2-Confident (read it all; understood it all reasonably well)
In this paper, the authors proposed to use a large corpus of unlabeled data to guide the learning of a convolutional neural network. Unlike conventional supervised learning of the CNN, this study trained the CNN in an unsupervised manner. The authors discovered a number of latent high-density quasi-classes from the large number of unlabeled images in an unsupervised manner. This procedure can be regarded as a kind of unsupervised image grouping, and the high-density quasi-classes can be regarded as potential “latent categories” or some “latent Attributes” that commonly appear among the images. Then, the authors used such quasi-classes (“latent categories” or “latent attributes”) as ground-truth labels of images to train the CNN (i.e. learning low-density separator (LDS)). In this way, compared to the old CNN representing patterns specialized to human supervision, we can regard the new CNN as a generative model describing hidden structures of unlabeled images. Thus, the new CNN is much easier to fine-tune with only a few object samples.
If we do not consider the great overlap with [17], this paper presents a good study for enhancing the transferability of the CNN, and the experimental results exhibited superior performance of the proposed method. I still have two questions about the technique. First, can the authors compare the proposed unsupervised initialization of the CNN with the widely used “msra” initialization method [cite]? Second, it is quite odd to average the H_1 x H_2 x F conv-layer features along the first and second dimensions to a 1 x 1 x F vectored feature, so as to make the conv-layer applicable to the proposed technique. It is because brutally averaging may significantly lose spatial information of local patterns in the images. Finally, to clarify the technical contributions, the authors should discuss the difference between the related work of [cite2,cite3,cite4]. the idea of [cite2] is also similar to the proposed method. The method of [cite2] extracted attributes of images, which took the role of “quasi-classes” in the proposed method. [cite2] then used the attributes as the category labels to train the CNN. Thus, the overall idea of [cite2] is very similar to the proposed method. [cite3] used a large number of unlabeled images to learn discriminative sub-space CNN features that are invariant w.r.t. different domains, in order to enhance the transferability of the CNN. [cite4] used cross-convolutional-layer pooling to mine subarrays of feature convolutional maps as features to improve the feature representation. [cite] He et al., Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, in CVPR 2015. [cite2] Ba et al., Predicting Deep Zero-Shot Convolutional Neural Networks using Textual Descriptions, in ICCV 2015 [cite3] Ganin et al., Unsupervised Domain Adaptation by Backpropagation [cite4] Liu et al., The Treasure beneath Convolutional Layers: Cross-convolutional-layer Pooling for Image Classification, in CVPR 2015
1-Less confident (might not have understood significant parts)
The paper proposes to use low-density separators to utilize large amount of unlabeled data after training image classification networks with label supervision. The high level features learned from the labeled data may not generalize well to the other domain directly, while it may be hard to fine tune the model directly due to lack of data. The proposed additional pre-training stage may make it possible to learn richer representation with much larger amount of unlabeled data.
Overall, the paper presents an encouraging method to make use of large number of unlabeled data to learn richer representation. In the evaluation, the proposed method is tested on multiple dataset and shows better performance. It may provide us more insights into unsupervised learning and provide future works a good reference. I have two concerns about the paper: - There is not result about stacking LDS layers. Intuitively, LDS may not be a good heuristic for layers not directly connected to the output layer. This may prohibit using LDS to learn richer representation across multiple layers. - What’s the running time of training time for LDS? Also, in Figure 4, it is better to use tables instead of truncated bars to show the results, because it is easier to understand the performance difference and to compare with future works.
2-Confident (read it all; understood it all reasonably well)
Fine-tuning higher layers of pre-trained CNNs is a commonly used technique. However, if the amount of labeled training data is small, fine-tuning cannot effectively adjust the parameters in the higher layers of the convolutional neural network which reduces the performance. In this paper, the authors use a lot of unlabeled images to pre-train the higher layers before fine-tuning the layers with a small amount of labeled data. Low Density Separator (LDS) is proposed.
The article was written reasonably well. There are no major flaws. However, the article was difficult to read.
2-Confident (read it all; understood it all reasonably well)