--- layout: default title: CompReSS ---

Soroush Abbasi Koohpayegani, Ajinkya Tejankar , Vipin Pillai, Hamed Pirsiavash

University of Maryland, Baltimore County

denote equal contribution


Abstractt

Self-supervised learning aims to learn good representations with unlabeled data. Recent works have shown that larger models benefit more from self-supervised learning than smaller models. As a result, the gap between the supervised and self-supervised learning has been greatly reduced for larger models. In this work, we focus on self-supervised learning for low capacity models that has various applications (e.g., edge computation). We compress a deep teacher model so that the student mimics the relative distances between the datapoints in the teacher embeddings. For ResNet-50, our method outperforms SOTA self-supervised models marginally on ImageNet linear evaluation and with a large margin on nearest neighbor evaluation (by 6 points). For AlexNet, our method outperforms all previous methods including the fully supervised model on ImageNet linear evaluation (57.6% compared to 56.5%) and by a large margin on nearest neighbor evaluation (52.3% compared to 41.4%). This is the first time a self-supervised AlexNet has outperformed supervised one on ImageNet classification.

Paper

CompReSS: Compressing Representations for Self-Supervised Learning

https://github.com/UMBCvision/CompReSS

Contributions

Using our framework(CompReSS), we train a model using unlabeled data which is better than training a self-supervised model from the scratch. Our models reduce the gap between SOTA self-supervised and supervised models and even outperformed supervised in Alexnet architecture.

Results

We train the teacher model on ImageNet without labels, compress it to the student model, and evaluate using ImageNet validation set. Our model consistently outperforms all compression methods for various teacher-student combinations. On NN, our ResNet50 is only 1 point worse than its ResNet50x4 teacher. when we compress ResNet50x4 to AlexNet, we get 57.6% for linear and 52.3% for NN evaluation which outperforms the supervised model by almost 1 point in Linear and 9 points in NN evaluation. For cluster alignment, we cluster our features on ImageNet using k-means, map each cluster to a unique ImageNet category, and evaluate on ImageNet validation set for Top-1 accuracy. Our method outperforms all baselines including the supervised one with a large margin (33.3% compared to 22.9%.) Note that any model below the teacher row uses the student architecture.

Comparison with SOTA self-supervised methods, our method outperforms all baselines including the model trained with ImageNet labels for AlexNet. We pick the best layer to report the results that is written in parenthesis: 'f7' refers to 'fc7' layer and 'c4' refers to 'conv4' layer. * refers to 10-crop evaluation.

We evaluate AlexNet trained with ResNet50x4 on PASCAL-VOC classification and detection tasks. For classification task, we only train a linear classifier on top of frozen backbone which is in contrast to the baselines that finetune all layers. Interestingly, even in this difficult setting, our method outperforms all self-supervised benchmarks including MoCo and is on par with the ImageNet supervised features. For object detection, we use the Fast-RCNN to finetune all layers.

Cluster Alignment Result

Our method is not designed specifically to learn good clustering. However, since it achieves good nearest neighbor results, we evaluate our features in clustering ImageNet dataset. The goal is to use k-means to cluster our self-supervised features trained on unlabeled ImageNet with no labels, map each cluster to an ImageNet category, and then evaluate on ImageNet validation set. In order to map clusters to categories, we first calculate the similarity between all (cluster, category) pairs by calculating the number of common images divided by the size of cluster. Then, we find the best mapping between clusters and categories using Hungarian algorithm. In the following, we show randomly selected images (columns) from randomly selected clusters (rows) for our best AlexNet modal. This is done with no manual inspection or cherry-picking. Note that most rows are aligned with semantic categories.