{"title": "Deep Representations and Codes for Image Auto-Annotation", "book": "Advances in Neural Information Processing Systems", "page_first": 908, "page_last": 916, "abstract": "The task of assigning a set of relevant tags to an image is challenging due to the size and variability of tag vocabularies. Consequently, most existing algorithms focus on tag assignment and fix an often large number of hand-crafted features to describe image characteristics. In this paper we introduce a hierarchical model for learning representations of full sized color images from the pixel level, removing the need for engineered feature representations and subsequent feature selection. We benchmark our model on the STL-10 recognition dataset, achieving state-of-the-art performance. When our features are combined with TagProp (Guillaumin et al.), we outperform or compete with existing annotation approaches that use over a dozen distinct image descriptors. Furthermore, using 256-bit codes and Hamming distance for training TagProp, we exchange only a small reduction in performance for efficient storage and fast comparisons. In our experiments, using deeper architectures always outperform shallow ones.", "full_text": "Deep Representations and Codes for Image\n\nAuto-Annotation\n\nRyan Kiros\n\nCsaba Szepesv\u00b4ari\n\nDepartment of Computing Science\n\nDepartment of Computing Science\n\nUniversity of Alberta\nEdmonton, AB, Canada\n\nrkiros@ualberta.ca\n\nUniversity of Alberta\nEdmonton, AB, Canada\n\nszepesva@ualberta.ca\n\nAbstract\n\nThe task of image auto-annotation, namely assigning a set of relevant tags to an\nimage, is challenging due to the size and variability of tag vocabularies. Conse-\nquently, most existing algorithms focus on tag assignment and \ufb01x an often large\nnumber of hand-crafted features to describe image characteristics. In this paper\nwe introduce a hierarchical model for learning representations of standard sized\ncolor images from the pixel level, removing the need for engineered feature rep-\nresentations and subsequent feature selection for annotation. We benchmark our\nmodel on the STL-10 recognition dataset, achieving state-of-the-art performance.\nWhen our features are combined with TagProp (Guillaumin et al.), we compete\nwith or outperform existing annotation approaches that use over a dozen distinct\nhandcrafted image descriptors. Furthermore, using 256-bit codes and Hamming\ndistance for training TagProp, we exchange only a small reduction in performance\nfor ef\ufb01cient storage and fast comparisons. Self-taught learning is used in all of our\nexperiments and deeper architectures always outperform shallow ones.\n\n1\n\nIntroduction\n\nThe development of successful methods for training deep architectures have in\ufb02uenced the devel-\nopment of representation learning algorithms either on top of SIFT descriptors [1, 2] or raw pixel\ninput [3, 4, 5] for feature extraction of full-sized images. Algorithms for pixel-based representation\nlearning avoid the use of any hand-crafted features, removing the dif\ufb01culty of deciding which fea-\ntures are better suited for the desired task. Furthermore, self-taught learning [6] can be employed,\ntaking advantage of feature learning from image databases independent of the target dataset.\nImage auto-annotation is a multi-label classi\ufb01cation task of assigning a set of relevant, descriptive\ntags to an image where tags often come from a vocabulary of hundreds to thousands of words.\nFigure 1 illustrates this task. Auto-annotation is a dif\ufb01cult problem due to the high variability of\ntags. Tags may describe objects, colors, scenes, local regions of the image (e.g. a building) or global\ncharacteristics (e.g. whether the image is outdoors). Consequently, many of the most successful\nannotation algorithms in the literature [7, 8, 9, 10, 11] have opted to focus on tag assignment and\noften \ufb01x a large number of hand-crafted features for input to their algorithms. The task of feature\nselection and applicability was studied by Zhang et al. [12] who utilized a group sparsity approach\nfor dropping features. Furthermore, they observed that feature importance varied across datasets and\nsome features led to redundancy, such as RGB and HSV histograms. Our main contribution in this\npaper is to remove the need to compute over a dozen hand-crafted features for annotating images\nand consequently remove the need for feature selection. We introduce a deep learning algorithm for\nlearning hierarchical representations of full-sized color images from the pixel level, which may be\nseen as a generalization of the approach by Coates et al. [13] to larger images and more layers. We\n\ufb01rst benchmark our algorithm on the STL-10 recognition dataset, achieving a classi\ufb01cation accuracy\n\n1\n\n\fof 62.1%. For annotation, we use the TagProp discriminitve metric learning algorithm [9] which has\nenjoyed state-of-the-art performance on popular annotation benchmarks. We test performance on\nthree datasets: Natural scenes, IAPRTC-12 and ESP-Game. When our features are combined with\nTagProp, we either compete with or outperform existing methods when 15 distinct hand-crafted\nfeatures and metrics are used. This gives the advantage of focusing new research on improving tag\nassignment algorithms without the need of deciding which features are best suited for the task.\n\nFigure 1: Sample annotation results on IAPRTC-12 (top) and ESP-Game (bottom) using TagProp\nwhen each image is represented by a 256-bit code. The \ufb01rst column of tags is the gold standard and\nthe second column are the predicted tags. Predicted tags that are italic are those that are also gold\nstandard.\n\nMore recently, auto-annotation algorithms have focused on scalability to large databases with hun-\ndreds of thousands to millions of images. Such approaches include that of Tsai et al. [10] who\nconstruct visual synsets of images and Weston et al. [11] who used joint word-image embeddings.\nOur second contribution proposes the use of representing an image with a 256-bit code for an-\nnotation. Torralba et al. [14] performed an extensive analysis of small codes for image retrieval\nshowing that even on databases with millions of images, linear search with Hamming distance can\nbe performed ef\ufb01ciently. We utilize an autoencoder with a single hidden layer on top of our learned\nhierarchical representations to construct codes. Experimental results show only a small reduction\nin performance is obtained compared to the original learned features. In exchange, 256-bit codes\nare ef\ufb01cient to store and can be compared quickly with bitwise operations. To our knowledge, our\napproach is the \ufb01rst to learn binary codes from full-sized color images without the use of hand-\ncrafted features. Such approaches often compute an initial descriptor such as GIST for representing\nan image. These approaches introduce too strong of a bottleneck too early, where the bottleneck in\nour pipeline comes after multiple layers of representation learning.\n\n2 Hierarchical representation learning\n\nIn this section we describe our approach for learning a deep feature representation from the pixel-\nlevel of a color image. Our approach involves aspects of typical pipelines: pre-processing and\nwhitening, dictionary learning, convolutional extraction and pooling. We de\ufb01ne a module as a pass\nthrough each of the above operations. We \ufb01rst introduce our setup with high level descriptions\nfollowed by a more detailed descriptions of each stage. Finally, we show how to stack multiple\nmodules on top of eachother.\nGiven a set of images, the learning phase of the network is as follows:\n\n1. Extract randomly selected patches from each image and apply pre-processing.\n2. Construct a dictionary using K-SVD.\n3. Convolve the dictionary with larger tiles extracted across the image with a pre-de\ufb01ned\n\nstride length. Re-assemble the outputs in a non-overlapping, spatially preserving grid.\n\n4. Pool over the reassembled features with a 2 layer pyramid.\n5. Repeat the above operations for as many modules as desired.\n\n2\n\n\fFor extracting features of a new image, we perform steps (3) and (4) for each module.\n\n2.1 Patch extraction and pre-processing\nLet {I (1), . . . , I (m)} be a set of m input images. For simplicity of explanation, assume I (i) \u2208\nRnV \u00d7nH\u00d73, i = 1 . . . m, though it need not be the case that all images are of the same size. Given a\nreceptive \ufb01eld of size r\u00d7c, we \ufb01rst extract np patches across all images of size r\u00d7c\u00d73, followed by\n\ufb02atting each patch into a column vector. Let X = {x(1), . . . , x(np)}, x(i) \u2208 Rn, i = 1 . . . np, n =\n3rc denote the extracted patches. We \ufb01rst perform mean centering and unit variance scaling across\nfeatures. This corresponds to local brightness and contrast normalization, respectively.\nNext we follow [13] by performing ZCA whitening, which results in patches having zero mean,\ni=1 x(i)(x(i))T = I. A whitening matrix is computed\nas W = V (Z + \u0001I)\u2212 1\n2 V T where C = V ZV T is an eigendecompostion of the centered covariance\nmatrix C = C(X) produced by mean subtraction of M = M (X). The parameter \u0001 is a small\npositive number having the effect of a low-pass \ufb01lter.\n\ni=1 x(i) = 0, and identity covariance, 1\nnp\n\n(cid:80)np\n\n(cid:80)np\n\n2.2 Dictionary learning\nLet S = {s(1), . . . , s(np)} denote the whitened patches. We are now ready to construct a set of\nbases from S. We follow Bo et al. [5] and use K-SVD for learning a dictionary. K-SVD constructs a\ndictionary D \u2208 Rn\u00d7k and a sparse representation \u02c6S \u2208 Rk\u00d7np by solving the following optimization\nproblem:\n\nminimize\n\nD, \u02c6S\n\n(cid:107)S \u2212 D \u02c6S(cid:107)2\n\nF\n\nsubject to ||\u02c6s(i)||0 \u2264 q \u2200i\n\n(1)\n\nwhere k is the desired number of bases. Optimization is done using alternation. When D is \ufb01xed, the\nproblem of obtaining \u02c6S can be decomposed into np subproblems of the form (cid:107)s(i)\u2212 D\u02c6s(i)(cid:107)2 subject\nto ||\u02c6s(i)||0 \u2264 q \u2200i which can be solved approximately using batch orthogonal matching pursuit [15].\nWhen \u02c6S is \ufb01xed, we update D by \ufb01rst expressing equation 1 in terms of a residual R(l):\n\n(cid:107)S \u2212 D \u02c6S(cid:107)2\n\nd(j)\u02c6s(j)T \u2212 d(l)\u02c6s(l)T (cid:107)2\n\nF = (cid:107)R(l) \u2212 d(l)\u02c6s(l)T (cid:107)2\n\nF\n\n(2)\n\nF = (cid:107)S \u2212(cid:88)\n\nj(cid:54)=l\n\nwhere l \u2208 {1, . . . , k}. A solution for d(l), the l-th column of D, can be obtained through an SVD of\nR(l). For space considerations, we refer the reader to Rubinstein et al. [15] for more details. 1\n\n2.3 Convolutional feature extraction\nGiven an image I (i), we \ufb01rst partition the image into a set of tiles T (i) of size nt \u00d7 nt with a pre-\nde\ufb01ned stride length s between each tile. Each patch in tile T (i)\nis processed in the same way as\nbefore dictionary construction (mean centering, contrast normalization, whitening) for which the\nmean and whitening matrices M and W are used. Let T (i)\ntj denote the t-th tile and j-th channel with\nj \u2208 Rr\u00d7c denote the l-th basis for channel j of D. The encoding\nrespect to image I (i) and let D(l)\nf (i)\ntl\n\nfor tile t and basis l is given by:\n\nt\n\n(cid:110)\n\n(cid:16) 3(cid:88)\n\n(cid:17)\n\n(cid:111)\n\nf (i)\ntl = max\n\ntanh\n\ntj \u2217 D(l)\nT (i)\n\nj\n\n, 0\n\n(3)\n\nwhere * denotes convolution and max and tanh operations are applied componentwise. Even though\nit is not the associated encoding with K-SVD, this type of \u2019surrogate coding\u2019 was studied by Coates\n\nj=1\n\n1We\n\nuse Rubinstein\u2019s\n\nimplementation\n\n\u02dcronrubin/software.html\n\navailable\n\nat http://www.cs.technion.ac.il/\n\n3\n\n\fet al. [13]. Let f (i)\nt denote the concatenated encodings over bases, which have a resulting dimension\nof (nt \u2212 r + 1) \u00d7 (nt \u2212 c + 1) \u00d7 k. These are then re-assembled into spatial-preserving, non-\noverlapping regions. See \ufb01gure 2 for an illustration. We perform one additional localized contrast\nt }. Similar types of\nnormalization over f (i)\nnormalization have been shown to be critical for performance by Ranzato et al. [16] and Bo et al.\n[5].\n\nt ))/max{\u00b5(\u03c3t), \u03c3(i)\n\nof the form f (i)\n\nt\n\nt \u2190 (f (i)\n\nt \u2212 \u00b5(f (i)\n\nFigure 2: Left: D is convolved with each tile (large green square) with receptive \ufb01eld (small blue\nsquare) over a given stride. The outputs are re-assembled in non-overlapping regions preserving\nspatial structure. Right: 2 \u00d7 2 and 1 \u00d7 1 regions are summed (pooled) along each cross section.\n\n2.4 Pooling\n\nt\n\nThe \ufb01nal step of our pipeline is to perform spatial pooling over the re-assembled regions of the\n. Consider the l-th cross section corresponding to the l-th dictionary element, l \u2208\nencodings f (i)\n{1, . . . , k}. We may then pool over each of the spatial regions of this cross section by summing\nover the activations of the corresponding spatial regions. This is done in the form of a 2-layer spatial\npyramid, where the base of the pyramid consists of 4 blocks of 2\u00d72 tiling and the top of the pyramid\nconsisting of a single block across the whole cross section. See \ufb01gure 2 for an illustration.\nOnce pooling is performed, the re-assembled encodings result in a shape of size 1 \u00d7 1 \u00d7 k and\n2 \u00d7 2 \u00d7 k from each layer of the pyramid. To obtain the \ufb01nal feature vector, each layer is \ufb02attened\ninto a vector and the resulting vectors are concatinated into a single long feature vector of dimension\n5k for each image I (i). Prior to classi\ufb01cation, these features are normalized to have zero mean and\nunit variance.\n\n2.5 Training multiple modules\n\nWhat we have described up until now is how to extract features using a single module corresponding\nto dictionary learning, extraction and pooling. We can now extend this framework into a deep\nnetwork by stacking multiple modules. Once the \ufb01rst module has been trained, we can take the\npooled features to be input to a second module. Freezing the learned dictionary from the \ufb01rst module,\nwe can then apply all the same steps a second time to the pooled representations. This type of stacked\ntraining can be performed to as many modules as desired.\nTo be more speci\ufb01c on the input to the second module, we use an additional spatial pooling operation\non the re-assembled encodings of the \ufb01rst module, where we extract 256 blocks of 16 \u00d7 16 tiling,\nresulting in a representation of size 16\u00d716\u00d7k. It is these inputs which we then pass on to the second\nmodule. We choose to use 16\u00d7 16 as a trade off to aggregating too much information and increasing\nmemory and time complexity. As an illustration, the same operations for the second module are\nused as in \ufb01gure 2 except the image is replaced with the 16 \u00d7 16 \u00d7 k pooled features. In the next\nmodule, the number of channels is equal to the number of bases from the previous module.\n\n3 Code construction and discriminitive metric learning\n\nIn this section we \ufb01rst show to learn binary codes from our learned features, followed by a review\nof the TagProp algorithm [9] used for annotation.\n\n4\n\n\f3.1 Learning binary codes for annotation\n\nOur codes are learned by adding an autoencoder with a single\nhidden layer on top of the learned output representations. Let\nf (i) \u2208 Rdm denote the learned representation for image I (i) of\ndimension dm using either a one or two module architecture. The\ncode b(i) for f (i) is computed by b(i) = round(\u03c3(f (i))) where\n\u03c3(f (i)) = (1 + exp(W f (i) + \u03b2))\u22121, W \u2208 Rdb\u00d7dm , \u03b2 \u2208 Rdb and\ndb is the number of bits (in our case, db = 256). Using a linear\noutput layer, our objective is to minimize the mean squared error\nof reconstructions of the the inputs given by 1\nm\n\n(cid:2)( \u02dcW \u03c3(f (i))+\n\u02dc\u03b2) \u2212 f (i)(cid:3)2, where \u02dcW \u2208 Rdm\u00d7db , \u02dc\u03b2 \u2208 Rdm are the second layer\n\n(cid:80)\n\ni\n\nweights and biases respectively. The objective is minimized us-\ning standard backpropagation.\nAs is, the optimization does not take into consideration the round-\ning used in the coding layer and consequently the output is not\nadapted for this operation. We follow Salakhutdinov et al. [17]\nand use additive \u2018deterministic\u2019 Gaussian noise with zero mean\nin the coding layer that is \ufb01xed in advance for each datapoint when performing a bottom-up pass\nthrough the network. Using unit variance was suf\ufb01cient to force almost all the activations near\n{0, 1}. We tried other approaches, including simple thresholding but found the Gaussian noise to be\nmost successful without interfering with the optimization. Figure 3 shows the coding layer activation\nvalues after backpropagation when noise has been added.\n\nFigure 3: Coding layer activa-\ntion values after training the au-\ntoencoder.\n\n3.2 The tag propagation (TagProp) algorithm\n\nLet V denote a \ufb01xed vocabulary of tags and I denote a list of input images. Our goal at test time,\ngiven a new input image i(cid:48), is to assign a set of tags v \u2208 V that are most relevant to the content\nof i(cid:48). TagProp operates on pairwise distances to learn a conditional distribution of words given\n(cid:80)\nimages. More speci\ufb01cally, let yiw \u2208 {1,\u22121}, i \u2208 I, w \u2208 V be an indicator for whether tag w is\npresent in image i. In TagProp, the probability that yiw = 1 is given by \u03c3(\u03b1wxiw + \u03b2w), xiw =\nj \u03c0ijyjw where \u03c3(z) = (1 + exp(\u2212z))\u22121 is the logistic function, (\u03b1w, \u03b2w) are word-speci\ufb01c\nmodel parameters to be estimated and \u03c0ij are distance-based weights also to be estimated. More\nspeci\ufb01cally, \u03c0ij is expressed as\n\n(cid:80)\nexp(\u2212dh(i, j))\nj(cid:48) exp(\u2212dh(i, j(cid:48)))\n\n,\n\n\u03c0ij =\n\ndh(i, j) = hdij,\n\nh \u2265 0\n\n(4)\n\nif yiw = 1 and 1\n\ni,w ciw log p(yiw), ciw = 1\nn+\n\nlikelihood of the data given by L =(cid:80)\n\nwhere we shall call dij the base distance between images i and j. Let \u03b8 = {\u03b1w\u2200w \u2208 V, \u03b2w\u2200w \u2208\nV, h} denote the list of model parameters. The model is trained to maximize the following quasi-\nn\u2212 otherwise,\nwhere n+ is the total number of positive labels of w and likewise for n\u2212 and missing labels. This\nweighting allows us to take into account imbalances between label presence and absence. Combined\nwith the logistic word models, it accounts for much higher recall in rare tags which would normally\nbe less likely to be recalled in a basic k-NN setup. Optimization of L is performed using a projected\ngradient method for enforcing the non-negativity constraint in h.\nThe choice of base distance used depends on the image representation. In the above description, the\nmodel was derived assuming only a single base distance is computed between images. This can be\ngeneralized to an arbitrary number of distances by letting h be a parameter vector and letting dh(i, j)\nbe a weighted combination of distances in h. Under this formulation, multiple descriptors of images\ncan be computed and weighted. The best performance of TagProp [9] was indeed obtained using this\nmultiple metric formulation in combination with the logistic word models. In our case, Euclidean\ndistance is used and Hamming distance for binary codes. Furthermore, we only consider pairwise\ndistances from the K nearest neighbors, where K is chosen though cross validation.\n\n5\n\n\f4 Experiments\n\nWe perform experimental evaluation of our methods on 4 datasets: one dataset, STL-10, for object\nrecognition to benchmark our hierarchical model and three datasets for annotation: Natural Scenes,\nIAPRTC-12 and ESP-Game 2 .\nFor all our experiments, we use k1 = 512 \ufb01rst module bases, k2 = 1024 second module bases,\nreceptive \ufb01eld sizes of 6 \u00d7 6 and 2 \u00d7 2 and tile sizes (nt) of 16 \u00d7 16 and 6 \u00d7 6. The total number\nof features for the combined \ufb01rst and second module representation is thus 5(k1 + k2) = 7680.\nImages are resized such that the longest side is no larger than 300 pixels with preserved aspect ratio.\nThe \ufb01rst module stride length is chosen based on the length of the longest side of the image: 4 if\nthe side is less than 128 pixels, 6 if less than 214 pixels and 8 otherwise. The second module stride\nlength is \ufb01xed at 2. For training the autoencoder, we use 10 epochs (passes over the training set)\nwith minibatches of size no larger than 1000. Optimization is done using Polak Ribiere conjugate\ngradients with 3 linesearches per minibatch. 3\nWe also incorporate the use of self-taught learning [6] in our annotation experiments by utilizing\nthe Mir\ufb02ickr dataset for dictionary learning. Mir\ufb02ickr is a collection of 25000 images taken from\n\ufb02ickr and deemed to have high interestness rating. We randomly sampled 10000 images from this\ndataset for training K-SVD on both modules. All reported results for Natural Scenes, IAPRTC-12\nand ESP-Game use self-taught learning. Our code for feature learning will be made available online.\n\n4.1 STL-10\n\nTable 1: A selection of the best results obtained on the STL-\n10 dataset.\nMethod\nSparse \ufb01ltering [18]\nOMP, k = 1600 [13]\nOMP, SC encoder, k = 1600 [13]\nReceptive \ufb01eld learning, 3 modules [19]\nVideo unsup features [20]\nHierarchical matching persuit [21]\n1st Module\n1st + 2nd Module\n\nThe STL-10 dataset is a collection of\n96\u00d7 96 images of 10 classes with im-\nages partitioned into 10 folds of 1000\nimages each and a test set of size\n8000. Alongside these labeled im-\nages is a set of 100000 unlabeled im-\nages that may or may not come from\nthe same distribution as the training\ndata. The evaluation procedure is\nto perform representation learning on\nthe unlabeled data and apply the rep-\nresentations to the training set, aver-\naging test errors across all folds. We\nrandomly chose 10000 images from\nthe unlabeled set for training and use\na linear L2-SVM for classi\ufb01cation with 5-fold cross validation for model selection.\nTable 1 shows our results on STL-10. Our 2 module architecture outperforms all existing approaches\nexcept for the recently proposed hierarchical matching pursuit (HMP). HMP uses joint layerwise\npooling and separate training for RGB and grayscale dictionaries, approaches which may also be\nadapted to our method. Moreover, we hypothesize that further improvements can be made when the\nreceptive \ufb01eld learning strategies of Coates et al. [19] and Jia et al. [22] are incorporated into a third\nmodule.\n\nAccuracy\n\n53.5%\n54.9%\n59.0%\n60.1%\n61.0%\n64.5%\n56.4 %\n62.1 %\n\n4.2 Natural scenes\n\nThe Natural Scenes dataset is a multi-label collection of 2000 images from 5 classes: desert, forest,\nmountain, ocean and sunset. We follow standard protocol and report the average results of 5 metrics\nusing 10 fold cross validation: Hamming loss (HL), one error (OE), coverage (C), ranking loss\n(RL) and average precision (AP). For space considerations, these metrics are de\ufb01ned in the ap-\npendix. To perform model selection with TagProp, we perform 5-fold cross validation with each of\nthe 10-folds to determine the value of K which minimizes Hamming loss.\n\n2Tags for IAPRTC-12 and ESP-Game as well as the features used by existing approaches can be found at\n\nhttp://lear.inrialpes.fr/people/guillaumin/data.php\n\n3Rasmussen\u2019s minimize routine is used.\n\n6\n\n\fTable 2: A selection of the best results obtained on the Natural Scenes dataset. Arrows indicate\ndirection of improved performance.\n\nRL \u2193\n0.168\n0.156\n0.140\n0.091\n0.080\n0.082\n0.074\n0.075\n\nAP \u2191\n0.803\n0.804\n0.830\n0.881\n0.895\n0.894\n0.910\n0.903\n\nMethod\nML-KNN [23]\nML-I2C [24]\nInsDif [25]\nML-LI2C [24]\n1st Module\n1st Module, 256-bit\n1st + 2nd Module\n1st + 2nd Module, 256-bit\n\nHL \u2193 OE \u2193\n0.169\n0.300\n0.311\n0.159\n0.259\n0.152\n0.190\n0.129\n0.170\n0.113\n0.113\n0.169\n0.140\n0.100\n0.106\n0.155\n\nC \u2193\n0.939\n0.883\n0.834\n0.624\n0.580\n0.585\n0.554\n0.558\n\nTable 3: A selection of the best results obtained on the IAPRTC-12 dataset (left) and ESP-Game\n(right) datasets.\n\nMethod\nMBRM [26]\nLASSO [7]\nJEC [7]\nGS [12]\nCCD [8]\nTagProp (\u03c3 SD) [9]\nTagProp (\u03c3 ML) [9]\n1st Module\n1st Module, 256-bit\n1st + 2nd Module\n1st + 2nd Module, 256-bit\n\nP\n0.24\n0.28\n0.28\n0.32\n0.44\n0.41\n0.46\n0.37\n0.34\n0.42\n0.36\n\nR\n0.23\n0.29\n0.29\n0.29\n0.29\n0.30\n0.35\n0.25\n0.22\n0.29\n0.25\n\nN+\n223\n246\n250\n252\n251\n259\n266\n241\n236\n252\n244\n\nP\n0.18\n0.21\n0.22\n\n-\n\n0.36\n0.39\n0.39\n0.37\n0.35\n0.38\n0.37\n\nR\n0.19\n0.24\n0.25\n\n-\n\n0.24\n0.24\n0.27\n0.20\n0.20\n0.22\n0.23\n\nN+\n209\n224\n224\n-\n232\n232\n239\n231\n231\n228\n236\n\nTable 2 shows the results of our method. In all \ufb01ve measures we obtain improvement over previous\nmethods. Furthermore, using 256-bit codes offers near equivalent performance. As in the case of\nSTL-10, improvements are made over a single module.\n\nIAPRTC-12 and ESP-Game\n\n4.3\nIAPRTC-12 is a collection of 20000 images with a vocabulary size of |V | = 291 and an average\nof 5.7 tags per image. ESP-Game is a collection of 60000 images with |V | = 268 and an average\nof 4.7 tags per class. Following Guillaumin et al. [9] we apply experiments to a pre-de\ufb01ned subset\nof 20000 images. Using standard protocol performance is evaluated using 3 measures: precision\n(P), recall (R) and the number of recalled tags (N+). N + indicates the number of tags that were\nrecalled at least once for annotation on the test set. Annotations are made by choosing the 5 most\nprobable tags for each image as is done with previous evaluations. As with the natural scenes dataset,\nwe perform 5-fold cross validation to determine K for training TagProp.\nTable 3 shows our results with IAPRTC-12 on the left and ESP-Game on the right. Our results\ngive comparable performance to CCD and the single distance (SD) variation of TagProp. Unfortu-\nnately, we are unable to match the recall values obtained with the multiple metric (ML) variation of\nTagProp. Of importance, we outperform GS who speci\ufb01cally studied the use of feature selection.\nOur 256-bit codes suffer a loss of performance on IAPRTC-12 but give near equivalent results on\nESP-Game. We note again that our features were learned on an entirely different dataset (Mir\ufb02ickr)\nin order to show their generalization capabilities.\nFinally, we perform two qualitative experiments. Figure 4 shows sample unsupervised retrieval\nresults using the learned 256-bit codes on IAPRTC-12 and ESP-Game while \ufb01gure 5 illustrates\nsample annotation performance when training on one dataset and annotating the other. These results\nshow that our codes are able to capture high-level semantic concepts that perform well for retrieval\nand transfer learning across datasets. We note however, that when annotating ESP-game when\ntraining was done on IAPRTC-12 led to more false human annotations (such as the bottom-right\n\n7\n\n\fFigure 4: Sample 256-bit unsupervised retrieval results on ESP-Game (top) and IAPRTC-12 (bot-\ntom). A query image from the test set is used to retrieve the four nearest neighbors from the training\nset.\n\nFigure 5: Sample 256-bit annotation results when training on one dataset and annotating the other.\nTop: Training on ESP-Game, annotation on IAPRTC-12. Bottom: Training on IAPRTC-12, anno-\ntation on ESP-Game.\n\nimage in \ufb01gure 5). We hypothesize that this is due to a larger proportion of persons in the IAPRTC-\n12 training set.\n\n5 Conclusion\n\nIn this paper we introduced a hierarchical model for learning feature representations of standard\nsized color images for the task of image annotation. Our results compare favorably to existing\napproaches that use over a dozen handcrafted image descriptors.\nOur primary goal for future work is to test the effectiveness of this approach on web-scale anno-\ntation systems with millions of images. The success of self-taught learning in this setting means\nonly one dictionary per module ever needs to be learned. Furthermore, our features can be used\nin combination with any nearest neighbor based algorithm for annotation. It is our hope that the\nsuccessful use of binary codes for annotation will allow further research to bridge the gap between\nthe annotation algorithms used on small scale problems to those required for web scale tasks. We\nalso intend to evaluate the effectiveness of semantic hashing on large databases when much smaller\ncodes are used. Krizhevsky et al. [27] evaluated semantic hashing using very deep autoencoders on\ntiny (32 \u00d7 32) images. Future work also involves performing similar experiments but on standard\nsized RGB images.\n\nAcknowledgments\n\nThe authors thank Axel Soto as well as the anonymous reviewers for helpful discussion and com-\nments. This work was funded by NSERC and the Alberta Innovates Centre for Machine Learning.\n\n8\n\n\fReferences\n[1] T Huang. Linear spatial pyramid matching using sparse coding for image classi\ufb01cation. CVPR, pages\n\n1794\u20131801, 2009.\n\n[2] K. Yu F. Lv T. Huang J. Wang, J. Yang and Y. Gong. Locality-constrained linear coding for image\n\nclassi\ufb01cation. In CVPR, pages 3360\u20133367, 2010.\n\n[3] R. Ranganath H Lee, R. Grosse and A.Y. Ng. Convolutional deep belief networks for scalable unsuper-\n\nvised learning of hierarchical representations. ICML, pages 1\u20138, 2009.\n\n[4] K. Yu, Y. Lin, and J. Lafferty. Learning image representations from the pixel level via hierarchical sparse\n\ncoding. In CVPR, pages 1713\u20131720, 2011.\n\n[5] L. Bo, X. Ren, and D. Fox. Hierarchical Matching Pursuit for Image Classi\ufb01cation: Architecture and Fast\n\nAlgorithms. In NIPS, 2011.\n\n[6] Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y Ng. Self-taught learning,\n\npages 759\u2013766. ICML, 2007.\n\n[7] A. Makadia, V. Pavlovic, and S. Kumar. A new baseline for image annotation. In ECCV, volume 8, pages\n\n316\u2013329, 2008.\n\n[8] H. Nakayama. Linear Distance Metric Learning for Large-scale Generic Image Recognition. PhD thesis,\n\nThe University of Tokyo.\n\n[9] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid. Tagprop: Discriminative metric learning in\n\nnearest neighbor models for image auto-annotation. In ICCV, pages 309\u2013316, 2009.\n\n[10] D. Tsai, Y. Jing, Y. Liu, H.A. Rowley, S. Ioffe, and J.M. Rehg. Large-scale image annotation using visual\n\nsynset. In ICCV, pages 611\u2013618, 2011.\n\n[11] J. Weston, S. Bengio, and N. Usunier. Large scale image annotation: learning to rank with joint word-\n\nimage embeddings. Machine Learning, 81(1):21\u201335, 2010.\n\n[12] S. Zhang, J. Huang, Y. Huang, Y. Yu, H. Li, and D.N. Metaxas. Automatic image annotation using group\n\nsparsity. In CVPR, pages 3312\u20133319, 2010.\n\n[13] A. Coates and A.Y. Ng. The importance of encoding versus training with sparse coding and vector\n\nquantization. In ICML, 2011.\n\n[14] A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image databases for recognition. In CVPR,\n\npages 1\u20138, 2008.\n\n[15] R. Rubinstein, M. Zibulevsky, and M. Elad. Ef\ufb01cient implementation of the k-SVD algorithm using batch\n\northogonal matching pursuit. Technical Report, 2008.\n\n[16] M. Ranzato K. Jarrett, K. Kavukcuoglu and Y. LeCun. What is the best multi-stage architecture for object\n\nrecognition? ICCV, 6:2146\u20132153, 2009.\n\n[17] G. Hinton and R. Salakhutdinov. Discovering binary codes for documents by learning deep generative\n\nmodels. Topics in Cognitive Science, 3(1):74\u201391, 2011.\n\n[18] Z. Chen S. Bhaskar J. Ngiam, P. W. Koh and A.Y. Ng. Sparse \ufb01ltering. NIPS, 2011.\n[19] A. Coates and A.Y. Ng. Selecting receptive \ufb01elds in deep networks. NIPS, 2011.\n[20] W. Zou, A. Ng, and Kai. Yu. Unsupervised learning of visual invariance with temporal coherence. In\n\nNIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.\n\n[21] L. Bo, X. Ren, and D. Fox. Unsupervised Feature Learning for RGB-D Based Object Recognition. In\n\nISER, June 2012.\n\n[22] Y. Jia, C. Huang, and T. Darrell. Beyond spatial pyramids: Receptive \ufb01eld learning for pooled image\n\nfeatures. In CVPR, 2012.\n\n[23] M.L. Zhang and Z.H. Zhou. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recog-\n\nnition, 40(7):2038\u20132048, 2007.\n\n[24] Y. Hu Z. Wang and L.T. Chia. Multi-label learning by image-to-class distance for scene classi\ufb01cation and\n\nimage annotation. In CIVR, pages 105\u2013112, 2010.\n\n[25] M.L. Zhang and Z.H. Zhou. Multi-label learning by instance differentiation. In AAAI, number 1, pages\n\n669\u2013674, 2007.\n\n[26] SL Feng, R. Manmatha, and V. Lavrenko. Multiple Bernoulli relevance models for image and video\n\nannotation. In CVPR, pages 1002\u20131009, 2004.\n\n[27] A. Krizhevsky and G.E. Hinton. Using very deep autoencoders for content-based image retrieval.\n\nESANN, 2011.\n\n9\n\n\f", "award": [], "sourceid": 424, "authors": [{"given_name": "Ryan", "family_name": "Kiros", "institution": null}, {"given_name": "Csaba", "family_name": "Szepesv\u00e1ri", "institution": null}]}