{"title": "Variational Autoencoder for Deep Learning of Images, Labels and Captions", "book": "Advances in Neural Information Processing Systems", "page_first": 2352, "page_last": 2360, "abstract": "A novel variational autoencoder is developed to model images, as well as associated labels or captions. The Deep Generative Deconvolutional Network (DGDN) is used as a decoder of the latent image features, and a deep Convolutional Neural Network (CNN) is used as an image encoder; the CNN is used to approximate a distribution for the latent DGDN features/code. The latent code is also linked to generative models for labels (Bayesian support vector machine) or captions (recurrent neural network). When predicting a label/caption for a new image at test, averaging is performed across the distribution of latent codes; this is computationally efficient as a consequence of the learned CNN-based encoder. Since the framework is capable of modeling the image in the presence/absence of associated labels/captions, a new semi-supervised setting is manifested for CNN learning with images; the framework even allows unsupervised CNN learning, based on images alone.", "full_text": "Variational Autoencoder for Deep Learning\n\nof Images, Labels and Captions\n\nYunchen Pu\u2020, Zhe Gan\u2020, Ricardo Henao\u2020, Xin Yuan\u2021, Chunyuan Li\u2020, Andrew Stevens\u2020\n\n\u2020Department of Electrical and Computer Engineering, Duke University\n\n{yp42, zg27, r.henao, cl319, ajs104, lcarin}@duke.edu\n\nand Lawrence Carin\u2020\n\n\u2021Nokia Bell Labs, Murray Hill\nxyuan@bell-labs.com\n\nAbstract\n\nA novel variational autoencoder is developed to model images, as well as associated\nlabels or captions. The Deep Generative Deconvolutional Network (DGDN) is used\nas a decoder of the latent image features, and a deep Convolutional Neural Network\n(CNN) is used as an image encoder; the CNN is used to approximate a distribution\nfor the latent DGDN features/code. The latent code is also linked to generative\nmodels for labels (Bayesian support vector machine) or captions (recurrent neural\nnetwork). When predicting a label/caption for a new image at test, averaging is\nperformed across the distribution of latent codes; this is computationally ef\ufb01cient as\na consequence of the learned CNN-based encoder. Since the framework is capable\nof modeling the image in the presence/absence of associated labels/captions, a\nnew semi-supervised setting is manifested for CNN learning with images; the\nframework even allows unsupervised CNN learning, based on images alone.\n\n1\n\nIntroduction\n\nConvolutional neural networks (CNNs) [1] are effective tools for image analysis [2], with most CNNs\ntrained in a supervised manner [2, 3, 4]. In addition to being used in image classi\ufb01ers, image features\nlearned by a CNN have been used to develop models for image captions [5, 6, 7]. Most recent work\non image captioning employs a CNN for image encoding, with a recurrent neural network (RNN)\nemployed as a decoder of the CNN features, generating a caption.\nWhile large sets of labeled and captioned images have been assembled, in practice one typically\nencounters far more images without labels or captions. To leverage the vast quantity of these latter\nimages (and to tune a model to the speci\ufb01c unlabeled/uncaptioned images of interest at test), semi-\nsupervised learning of image features is of interest. To account for unlabeled/uncaptioned images,\nit is useful to employ a generative image model, such as the recently developed Deep Generative\nDeconvolutional Network (DGDN) [8, 9]. However, while the CNN is a feedforward model for image\nfeatures (and is therefore fast at test time), the original DGDN implementation required relatively\nexpensive inference of the latent image features. Speci\ufb01cally, in [8] parameter learning and inference\nare performed with Gibbs sampling or Monte Carlo Expectation-Maximization (MCEM).\nWe develop a new variational autoencoder (VAE) [10] setup to analyze images. The DGDN [8] is\nused as a decoder, and the encoder for the distribution of latent DGDN parameters is based on a\nCNN (termed a \u201crecognition model\u201d [10, 11]). Since a CNN is used within the recognition model,\ntest-time speed is much faster than that achieved in [8]. The VAE framework manifests a novel means\nof semi-supervised CNN learning: a Bayesian SVM [12] leverages available image labels, the DGDN\nmodels the images (with or without labels), and the CNN manifests a fast encoder for the distribution\nof latent codes. For image-caption modeling, latent codes are shared between the CNN encoder,\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fDGDN decoder, and RNN caption model; the VAE learns all model parameters jointly. These models\nare also applicable to images alone, yielding an unsupervised method for CNN learning.\nOur DGDN-CNN model for images is related to but distinct from prior convolutional variational\nauto-encoder networks [13, 14, 15]. In those models the pooling process in the encoder network is\ndeterministic (max-pooling), as is the unpooling process in the decoder [14] (related to upsampling\n[13]). Our model uses stochastic unpooling, in which the unpooling map (upsampling) is inferred\nfrom the data, by maximizing a variational lower bound.\nSummarizing, the contributions of this paper include: (i) a new VAE-based method for deep decon-\nvolutional learning, with a CNN employed within a recognition model (encoder) for the posterior\ndistribution of the parameters of the image generative model (decoder); (ii) demonstration that the fast\nCNN-based encoder applied to the DGDN yields accuracy comparable to that provided by Gibbs sam-\npling and MCEM based inference, while being much faster at test time; (iii) the \ufb01rst semi-supervised\nCNN classi\ufb01cation results, applied to large-scale image datasets; and (iv) extensive experiments\non image-caption modeling, in which we demonstrate the advantages of jointly learning the image\nfeatures and caption model (we also present semi-supervised experiments for image captioning).\n\n2 Variational Autoencoder Image Model\n\nImage Decoder: Deep Deconvolutional Generative Model\n\n2.1\nConsider N images {X(n)}N\nn=1, with X(n) \u2208 RNx\u00d7Ny\u00d7Nc; Nx and Ny represent the number of\npixels in each spatial dimension, and Nc denotes the number of color bands in the image (Nc = 1 for\ngray-scale images and Nc = 3 for RGB images).\nTo introduce the image decoder (generative model) in its simplest form, we \ufb01rst consider a decoder\nwith L = 2 layers. The code {S(n,k2,2)}K2\nk2=1 feeds the decoder at the top (layer 2), and at the bottom\n(layer 1) the image X(n) is generated:\n\nX(n) \u223c N (\u02dcS(n,1), \u03b1\u22121\n\n0 I)\n\nS(n,1) \u223c unpool(\u02dcS(n,2))\n\nk1=1 D(k1,1) \u2217 S(n,k1,1)\n\nLayer 2:\nUnpool:\nLayer 1:\nData Generation:\n\n(1)\n(2)\n(3)\n(4)\nEquation (4) is meant to indicate that E(X(n)) = \u02dcS(n,1), and each element of X(n) \u2212 E(X(n)) is iid\nzero-mean Gaussian with precision \u03b10.\nConcerning notation, expressions with two superscripts, D(kl,l), S(n,l) and \u02dcS(n,l), for layer l \u2208 {1, 2}\nand image n \u2208 {1, . . . , N}, are 3D tensors. Expressions with three superscripts, S(n,kl,l), are 2D\nactivation maps, representing the klth \u201cslice\u201d of 3D tensor S(n,l); S(n,kl,l) is the spatially-dependent\nactivation map for image n, dictionary element kl \u2208 {1, . . . , Kl}, at layer l of the model. Tensor S(n,l)\nis formed by spatially aligning and \u201cstacking\u201d the {S(n,kl,l)}Kl\nkl=1. Convolution D(kl,l) \u2217 S(n,kl,l)\nbetween 3D D(kl,l) and 2D S(n,kl,l) indicates that each of the Kl\u22121 2D \u201cslices\u201d of D(kl,l) is\nconvolved with the spatially-dependent S(n,kl,l); upon aligning and \u201cstacking\u201d these convolutions, a\ntensor output is manifested for D(kl,l) \u2217 S(n,kl,l) (that tensor has Kl\u22121 2D slices).\nAssuming dictionary elements {D(kl,l)} are known, along with the precision \u03b10. We now discuss\nthe generative process of the decoder. The layer-2 activation maps {S(n,k2,2)}K2\nk2=1 are the code\nthat enters the decoder. Activation map S(n,k2,2) is spatially convolved with D(k2,2), yielding a 3D\ntensor; summing over the K2 such tensors manifested at layer-2 yields the pooled 3D tensor \u02dcS(n,2).\nStochastic unpooling (discussed below) is employed to go from \u02dcS(n,2) to S(n,1). Slice k1 of S(n,1),\nS(n,k1,1), is convolved with D(k1,1), and summing over k1 yields E(X(n)).\nFor the stochastic unpooling, S(n,k1,1) is partitioned into contiguous px \u00d7 py pooling blocks (analo-\n\u2208 {0, 1}pxpy be a vector\ngous to pooling blocks in CNN-based activation maps [1]). Let z(n,k1,1)\nof pxpy \u2212 1 zeros, and a single one; z(n,k1,1)\ncorresponds to pooling block (i, j) in S(n,k1,1). The\n\ni,j\n\n\u02dcS(n,2) =(cid:80)K2\n\u02dcS(n,1) =(cid:80)K1\n\nk2=1 D(k2,2) \u2217 S(n,k2,2)\n\ni,j\n\n2\n\n\fi,j\n\ni,j\n\ni,j\n\nlocation of the non-zero element of z(n,k1,1)\nidenti\ufb01es the location of the single non-zero element\nin the corresponding pooling block of S(n,k1,1). The non-zero element in pooling block (i, j) of\nS(n,k1,1) is set to \u02dcS(n,k1,2)\n, i.e., element (i, j) in slice k1 of \u02dcS(n,2). Within the prior of the decoder,\n\u223c Mult(1; 1/(pxpy), . . . , 1/(pxpy)). Both \u02dcS(n,2) and S(n,2) are 3D tensors with\nwe impose z(n,k1,1)\nK1 2D slices; as a result of the unpooling, the 2D slices in the sparse S(n,2) have pxpy times more\nelements than the corresponding slices in the dense \u02dcS(n,2).\nThe above model may be replicated to constitute L > 2 layers. The decoder is represented concisely\nas p\u03b1(X|s, z), where vector s denotes the \u201cunwrapped\u201d set of top-layer features {S(\u00b7,kL,L)}, and\nvector z denotes the unpooling maps at all L layers. The model parameters \u03b1 are the set of dictionary\nelements at the L layers, as well as the precision \u03b10. The prior over the code is p(s) = N (0, I).\n2.2\nTo make explicit the connection between the proposed CNN-based encoder and the above decoder,\nwe also initially illustrate the encoder with an L = 2 layer model. While the two-layer decoder in\n(1)-(4) is top-down, starting at layer 2, the encoder is bottom-up, starting at layer 1 with image X(n):\n\nImage Encoder: Deep CNN\n\nLayer 1:\nPool:\nLayer 2:\n\nCode Generation:\n\n\u02dcC(n,k1,1) = X(n) \u2217s F(k1,1) , k1 = 1, . . . , K1\nC(n,1) \u223c pool( \u02dcC(n,1))\n\u02dcC(n,k2,2) = C(n,1) \u2217s F(k2,2) , k2 = 1, . . . , K2\n\nsn \u223c N(cid:16)\n\n\u00b5\u03c6( \u02dcC(n,2)), diag(\u03c32\n\n\u03c6( \u02dcC(n,2)))\n\n(cid:17)\n\n(5)\n(6)\n(7)\n\n(8)\n\nk2=1 are aligned and \u201cstacked\u201d to manifest \u02dcC(n,2).\n\nImage X(n) and \ufb01lter F(k1,1) are each tensors, composed of Nc stacked 2D images (\u201cslices\u201d). To\nimplement X(n) \u2217s F(k1,1), the respective spatial slices of X(n) and F(k1,1) are convolved; the results\nof the Nc convolutions are aligned spatially and summed, yielding a single 2D spatially-dependent\n\ufb01lter output \u02dcC(n,k1,1) (hence notation \u2217s, to distinguish \u2217 in (1)-(4)).\nThe 2D maps { \u02dcC(n,k1,1)}K1\nk1=1 are aligned spatially and \u201cstacked\u201d to constitute the 3D tensor \u02dcC(n,1).\nEach contiguous px \u00d7 py pooling region in \u02dcC(n,1) is stochastically pooled to constitute C(n,1); the\nposterior pooling statistics in (6) are detailed below. Finally, the pooled tensor C(n,1) is convolved\nwith K2 layer-2 \ufb01lters {F(k2,2)}K2\nk2=1, each of which yields the 2D feature map \u02dcC(n,k2,2); the K2\nfeature maps { \u02dcC(n,k2,2)}K2\nConcerning the pooling in (6), let \u02dcC(n,k1,1)\nre\ufb02ect the pxpy components in pooling block (i, j) of\n\u02dcC(n,k1,1). Using a multi-layered perceptron (MLP), this is mapped to the pxpy-dimensional real vec-\ntor \u03b7(n,k1,1)\n.\nThe pooling vector is drawn z(n,k1,1)\n)); as a recognition model,\nMult(1; Softmax(\u03b7(n,k1,1)\n)) is also treated as the posterior distribution for the DGDN unpooling in\n(2). Similarly, to constitute functions \u00b5\u03c6( \u02dcC(n,2)) and \u03c32\n\u03c6( \u02dcC(n,2)) in (8), each layer of \u02dcC(n,2) is fed\nthrough a distinct MLP. Details are provided in the Supplementary Material (SM).\nParameters \u03c6 of q\u03c6(s, z|X) correspond to the \ufb01lter banks {F(kl,l)}, as well as the parameters of\nthe MLPs. The encoder is a CNN (yielding fast testing), utilized in a novel manner to manifest a\nposterior distribution on the parameters of the decoder. As discussed in Section 4, the CNN is trained\nin a novel manner, allowing semi-supervised and even unsupervised CNN learning.\n\n\u223c Mult(1; Softmax(\u03b7(n,k1,1)\n\n= W1h, with h = tanh\n\n), de\ufb01ned as \u03b7(n,k1,1)\n\n= MLP( \u02dcC(n,k1,1)\n\n(cid:16)\n\nW2vec( \u02dcC(n,k1,1)\n\n)\n\ni,j\n\n(cid:17)\n\ni,j\n\ni,j\n\ni,j\n\ni,j\n\ni,j\n\ni,j\n\ni,j\n\n3 Leveraging Labels and Captions\n\n3.1 Generative Model for Labels: Bayesian SVM\nAssume a label (cid:96)n \u2208 {1, . . . , C} is associated with training image X(n); in the discussion that\nfollows, labels are assumed available for each image (for notational simplicity), but in practice only a\nsubset of the N training images need have labels. We design C one-versus-all binary SVM classi\ufb01ers\n\n3\n\n\fn=1.\n\nn }N\n\nn=1, with y((cid:96))\n\nn = 1, and y((cid:96))\n\nn \u2208 {\u22121, 1}. If (cid:96)n = (cid:96) then y((cid:96))\n\n[16], responsible for mapping top-layer image features sn to label (cid:96)n; sn is the same image code as\nin (8), from the top DGDN layer. For the (cid:96)-th classi\ufb01er, with (cid:96) \u2208 {1, . . . , C}, the problem may be\nposed as training with {sn, y((cid:96))\nn = \u22121\notherwise. Henceforth we consider the Bayesian SVM for each one of the binary learning tasks, with\nlabeled data {sn, yn}N\nGiven a feature vector s, the goal of the SVM is to \ufb01nd an f (s) that minimizes the objective function\nn=1 max(1 \u2212 ynf (sn), 0) + R(f (s)), where max(1\u2212 ynf (sn), 0) is the hinge loss, R(f (s)) is\na regularization term that controls the complexity of f (s), and \u03b3 is a tuning parameter controlling the\ntrade-off between error penalization and the complexity of the classi\ufb01cation function. Recently, [12]\nshowed that for the linear classi\ufb01er f (s) = \u03b2T s, minimizing the SVM objective function is equivalent\nn=1 L(yn|sn, \u03b2, \u03b3)p(\u03b2|\u00b7),\nwhere y = [y1 . . . yN ]T , S = [s1 . . . sN ], L(yn|sn, \u03b2, \u03b3) is the pseudo-likelihood function, and\np(\u03b2|\u00b7) is the prior distribution for the vector of coef\ufb01cients \u03b2. In [12] it was shown that L(yn|sn, \u03b2, \u03b3)\nadmits a location-scale mixture of normals representation by introducing latent variables \u03bbn:\n\n\u03b3(cid:80)N\nto estimating the mode of the pseudo-posterior of \u03b2: p(\u03b2|S, y, \u03b3) \u221d(cid:81)N\nL(yn|sn, \u03b2, \u03b3) = e\u22122\u03b3 max(1\u2212yn\u03b2T sn,0) =(cid:82) \u221e\n\n(cid:16)\u2212 (1+\u03bbn\u2212yn\u03b2T sn)2\n\nd\u03bbn.\n\n(9)\n\n\u221a\n\u03b3\u221a\n\n(cid:17)\n\nexp\n\n2\u03b3\u22121\u03bbn\n\n0\n\n2\u03c0\u03bbn\n\nNote that (9) is a mixture of Gaussian distributions w.r.t. random variable yn\u03b2T sn, where the mixture\nis formed with respect to \u03bbn, which controls the mean and variance of the Gaussians. This encourages\ndata augmentation for variable \u03bbn , permitting ef\ufb01cient Bayesian inference (see [12, 17] for details).\nParameters {\u03b2(cid:96)}C\n(cid:96)=1 for the C binary SVM classi\ufb01ers are analogous to the fully connected parameters\nof a softmax classi\ufb01er connected to the top of a traditional CNN [2]. If desired, the pseudo-likelihood\nof the SVM-based classi\ufb01er can be replaced by a softmax-based likelihood. In Section 5 we compare\nperformance of the SVM and softmax based classi\ufb01ers.\n\n3.2 Generative Model for Captions\n\nt\n\nt\n\nt\n\nt\n\nt=2 p(y(n)\n\n1 , . . . , y(n)\nTn\n\nt = Wey(n)\n\n1 |sn)(cid:81)Tn\n\n, is embedded into an M-dimensional vector w(n)\n\n|y(n)\n<t , sn). Speci\ufb01cally, we generate the \ufb01rst word y(n)\n1 ) = softmax(Vh(n)\n1 ), where h(n)\n\nFor image n, assume access to an associated caption Y(n); for notational simplicity, we again assume\na caption is available for each training image, although in practice captions may only be available\non a subset of images. The caption is represented as Y(n) = (y(n)\n), and y(n)\nis a 1-of-V\n(\u201cone-hot\u201d) encoding, with V the size of the vocabulary, and Tn the length of the caption for image n.\n, where We \u2208 RM\u00d7V is\nWord t, y(n)\na word embedding matrix (to be learned), i.e., w(n)\nis a column of We, chosen by the one-hot y(n)\n.\nThe probability of caption Y(n) given top-layer DGDN image features sn is de\ufb01ned as p(Y(n)|sn) =\np(y(n)\nfrom sn, with\np(y(n)\n1 = tanh(Csn). Bias terms are omitted for simplicity.\nAll other words in the caption are then sequentially generated using a recurrent neural network\n(RNN), until the end-sentence symbol is generated. Each conditional p(y(n)\n<t , sn) is speci\ufb01ed as\nt = H(w(n)\nsoftmax(Vh(n)\nt\u22121). C and V are\nweight matrices (to be learned), and V is used for computing a distribution over words.\nThe transition function H(\u00b7) can be implemented with a gated activation function, such as Long\nShort-Term Memory (LSTM) [18] or a Gated Recurrent Unit (GRU) [19]. Both LSTM and GRU\nhave been proposed to address the issue of learning long-term dependencies. In experiments we have\nfound that GRU provides slightly better performance than LSTM (we implemented and tested both),\nand therefore the GRU is used.\n\nis recursively updated through h(n)\n\n|y(n)\nt\u22121, h(n)\n\n), where h(n)\n\n1\n\nt\n\nt\n\nt\n\nt\n\nt\n\n4 Variational Learning of Model Parameters\n\nTo make the following discussion concrete, we describe learning and inference within the context\nof images and captions, combining the models in Sections 2 and 3.2. This learning setup is also\napplied to model images with associated labels, with the caption model replaced in that case with\nthe Bayesian SVM of Section 3.1 (details provided in the SM). In the subsequent discussion we\nemploy the image encoder q\u03c6(s, z|X), the image decoder p\u03b1(X|s, z), and the generative model for\nthe caption (denoted p\u03c8(Y|s), where \u03c8 represents the GRU parameters).\n\n4\n\n\fThe desired parameters {\u03c6, \u03b1, \u03c8} are optimized by minimizing the variational lower bound. For a\nsingle captioned image, the variational lower bound L\u03c6,\u03b1,\u03c8(X, Y) can be expressed as\n\nL\u03c6,\u03b1,\u03c8(X, Y) = \u03be(cid:8)Eq\u03c6(s|X)[log p\u03c8(Y|s)](cid:9) + Eq\u03c6(s,z|X)[log p\u03b1(X, s, z) \u2212 log q\u03c6(s, z|X)]\n\nwhere \u03be is a tuning parameter that balances the two components of L\u03c6,\u03b1,\u03c8(X, Y). When \u03be is set to\nzero, it corresponds to the variational lower bound for a single uncaptioned image:\n\nU\u03c6,\u03b1(X) = Eq\u03c6(s,z|X)[log p\u03b1(X, s, z) \u2212 log q\u03c6(s, z|X)]\n\n(10)\n\nJ\u03c6,\u03b1,\u03c8 =(cid:80)\n\nL\u03c6,\u03b1,\u03c8(X, Y) +(cid:80)\n\nThe lower bound for the entire dataset is then:\n\nU\u03c6,\u03b1(X)\n\nX\u2208Du\n\n(X,Y)\u2208Dc\n\n(11)\nwhere Dc denotes the set of training images with associated captions, and Du is the set of training\nimages that are uncaptioned (and unlabeled).\nTo optimize J\u03c6,\u03b1,\u03c8 w.r.t. \u03c6, \u03c8 and \u03b1, we utilize Monte Carlo integration to approximate the\nexpectation, Eq\u03c6(s,z|X), and stochastic gradient descent (SGD) for parameter optimization. We use\nthe variance reduction techniques in [10] and [11] to compute the gradients. Details are provided in\nthe SM.\nWhen \u03be is set to 1, L\u03c6,\u03b1,\u03c8(X, Y) recovers the exact variational lower bound. Motivated by assigning\nthe same weight to every data point, we set \u03be = NX /(T \u03c1) or NX /(C\u03c1) in the experiments, where\nNX is the number of pixels in each image, T is the number of words in the corresponding caption, C\nis the number of categories for the corresponding label and \u03c1 is the proportion of labeled/captioned\ndata in the mini-batch.\nAt test time, we consider two tasks: inference of a caption or label for a new image X(cid:63). Again,\nconsidering captioning of a new image (with similar inference for labeling), after the model parameters\ns \u223c q\u03c6(s|X =\nX(cid:63)), and Ns is the number of samples. Monte Carlo sampling is used to approximate the integral,\nand the recognition model, q\u03c6(s|X), is employed to approximate p(s|X), for fast inference of image\nrepresentation.\n\nare learned p(Y(cid:63)|X(cid:63)) =(cid:82) p\u03c8(Y(cid:63)|s(cid:63))p(s(cid:63)|X(cid:63))ds(cid:63) \u2248(cid:80)Ns\n\ns=1 p\u03c8(Y(cid:63)|s(cid:63)\n\ns), where s(cid:63)\n\n5 Experiments\n\nThe architecture of models and initialization of model parameters are provided in the SM. No dataset-\nspeci\ufb01c tuning other than early stopping on validation sets was conducted. The Adam algorithm [20]\nwith learning rate 0.0002 is utilized for optimization of the variational learning expressions in\nSection 4. We use mini-batches of size 64. Gradients are clipped if the norm of the parameter vector\nexceeds 5, as suggested in [21]. All the experiments of our models are implemented in Theano [22]\nusing a NVIDIA GeForce GTX TITAN X GPU with 12GB memory.\n\n5.1 Benchmark Classi\ufb01cation\n\nWe \ufb01rst present image classi\ufb01cation results on MNIST, CIFAR-10 & -100 [23], Caltech 101 [24] &\n256 [25], and ImageNet 2012 datasets. For Caltech 101 and Caltech 256, we use 30 and 60 images\nper class for training, respectively. The predictions are based on averaging the decision values of\nNs = 50 collected samples from the approximate posterior distribution over the latent variables from\nq\u03c6(s|X). As a reference for computational cost, our model takes about 5 days to train on ImageNet.\nWe compared our VAE setup to a VAE with deterministic unpooling, and we also compare with\na DGDN trained using Gibbs sampling and MCEM [8]; classi\ufb01cation results and testing time are\nsummarized in Table 1. Other state-of-the-art results can be found in [8]. The results based on\nGibbs sampling and MCEM are obtained by our own implementation on the same GPU, which are\nconsistent with the classi\ufb01cation accuracies reported in [8].\nFor Gibbs-sampling-based learning, only suitable for the \ufb01rst \ufb01ve small/modest size datasets we\nconsider, we collect 50 posterior samples of model parameters \u03b1, after 1000 burn-in iterations during\ntraining. Given a sample of model parameters, the inference of top-layer features at test is also done\nvia Gibbs sampling. Speci\ufb01cally, we collect 100 samples after discarding 300 burn-in samples; fewer\nsamples leads to worse performance. The predictions are based on averaging the decision values\n\n5\n\n\fTable 1: Classi\ufb01cation error (%) and testing time (ms per image) on benchmarks.\n\nCaltech 256\ntest\ntest\ntime\nerror\n52.3\n29.50\n8.9\n30.13\n0.3\n32.18\n0.3\n29.33\n\nMethod\n\nGibbs [8]\nMCEM [8]\n\nVAE-d\n\nVAE (Ours)\n\ntest\nerror\n0.37\n0.45\n0.42\n0.38\n\nMNIST\n\nCIFAR-10\ntest\ntest\ntest\ntime\nerror\ntime\n10.4\n8.21\n3.1\n1.1\n9.04\n0.8\n0.02\n0.007\n10.74\n0.007\n0.02\n8.19\nImageNet 2012\n\nCIFAR-100\ntest\ntest\ntime\nerror\n10.4\n34.33\n1.1\n35.92\n0.02\n37.96\n0.02\n35.01\n\nCaltech 101\ntest\ntest\ntime\nerror\n50.4\n12.87\n8.8\n13.51\n0.3\n14.79\n0.3\n11.99\nImageNet Pretrained for\n\nMethod\n\nMCEM [8]\nVAE (Ours)\n\ntop-1\nerror\n37.9\n38.2\n\ntop-5\nerror\n16.1\n15.7\n\ntest\ntime\n14.4\n1.0\n\nCaltech 101\n\nCaltech 256\n\ntest error\n\ntest time\n\ntest error\n\ntest time\n\n6.85\n6.91\n\n14.1\n0.9\n\n22.10\n22.53\n\n14.2\n0.9\n\nof the collected samples (50 samples of model parameters \u03b1, and for each 100 inference samples\nof latent parameters s and z, for a total of 5000 samples). With respect to the testing of MCEM,\nall data-dependent latent variables are integrated (summed) out in the expectation, except for the\ntop-layer feature map, for which we \ufb01nd a MAP point estimate via gradient descent.\nAs summarized in Table 1, the proposed recognition model is much faster than Gibbs sampling and\nMCEM at test time (up to 400x speedup), and yields accuracy commensurate with these other two\nmethods (often better). To illustrate the role of stochastic unpooling, we replaced it with deterministic\nunpooling as in [14]. The results, indicated as VAE-d in Table 1, demonstrate the powerful capabilities\nof the stochastic unpooling operation. We also tried VAE-d on the ImageNet 2012 dataset; however,\nthe performance is much worse than our proposed VAE, hence those results are not reported.\n\n5.2 Semi-Supervised Classi\ufb01cation\n\nWe now consider semi-supervised classi\ufb01cation. With each mini-batch, we use 32 labeled samples\nand 32 unlabeled samples, i.e., \u03c1 = 0.5.\n\nTable 2: Semi-supervised classi\ufb01cation error (%) on MNIST. N is the number of labeled images per class.\nN TSVM Deep generative model [26]\nM1+M2\n3.33 \u00b1 0.14\n10\n2.59 \u00b10.05\n60\n2.40 \u00b10.02\n100\n2.18 \u00b10.04\n300\n\n5.83 \u00b1 0.97\n2.19 \u00b1 0.19\n1.75 \u00b1 0.14\n1.42 \u00b1 0.08\n*These results are achieved with our own implementation based on the publicly available code.\n\nLadder network [27]\n\u0393-conv\n\u0393-full\n0.89\u00b10.50\n0.82 \u00b1 0.17*\n0.74 \u00b1 0.10*\n0.63 \u00b1 0.02*\n\nM1+TSVM\n11.82\u00b1 0.25\n5.72\u00b1 0.05\n4.24\u00b1 0.07\n3.49\u00b1 0.04\n\n\u03be = Nx/(C\u03c1)\n1.49 \u00b1 0.36\n0.77 \u00b1 0.09\n0.63 \u00b1 0.06\n0.51 \u00b1 0.04\n\n1.06 \u00b1 0.37\n0.84 \u00b1 0.08\n\n-\n\n16.81\n6.16\n5.38\n3.45\n\nOur model\n\n\u03be = 0\n\n-\n\nMNIST We \ufb01rst test our model on the MNIST classi\ufb01cation benchmark. We randomly split the\n60,000 training samples into a 50,000-sample training set and a 10,000-sample validation set (used to\nevaluate early stopping). The training set is further randomly split into a labeled and unlabeled set,\nand the number of labeled samples in each category varies from 10 to 300. We perform testing on the\nstandard 10,000 test samples with 20 different training-set splits.\nTable 2 shows the classi\ufb01cation results. For \u03be = 0, the model is trained in an unsupervised manner.\nWhen doing unsupervised learning, the features extracted by our model are sent to a separate\ntransductive SVM (TSVM). In this case, our results can be directly compared to the results of the\nM1+TSVM model [26], demonstrating the effectiveness of our recognition model in providing good\nrepresentations of images. Using 10 labeled images per class, our semi-supervised learning approach\nwith \u03be = Nx/(C\u03c1) achieves a test error of 1.49, which is competitive with state-of-the-art results [27].\nWhen using a larger number of labeled images, our model consistently achieves the best results.\n\nImageNet 2012\nImageNet 2012 is used to assess the scalability of our model to large datasets\n(also considered, for supervised learning, in Table 1). Since no comparative results exist for semi-\nsupervised learning with ImageNet, we implemented the 8-layer AlexNet [2] and the 22-layer\nGoogLeNet [4] as the supervised model baselines, which were trained by utilizing only the labeled\ndata1. We split the 1.3M training images into a labeled and unlabeled set, and vary the proportion\n1We use the default settings in the Caffe package, which provide a top-1 accuracy of 57.1% and 68.7%, as\nwell as a top-5 accuracy of 80.2% and 88.9% on the validation set for AlexNet and GoogLeNet, respectively.\n\n6\n\n\fof labeled images from 1% to 100%. The classes are balanced to ensure that no particular class is\nover-represented, i.e., the ratio of labeled and unlabeled images is the same for each class. We repeat\nthe training process 10 times, and each time we utilize different sets of images as the unlabeled ones.\nFigure 1 shows our results, together with\nthe baselines. Tabulated results and a\nplot with error bars are provided in the\nSM. The variance of our model\u2019s results\n(caused by different randomly selected\nlabeled examples) is around 1% when\nconsidering a small proportion of labeled\nimages (less than 10% labels), and the\nvariance drops to less than 0.2% when\nthe proportion of labeled images is larger\nthan 30%. As can be seen from Figure 1,\nour semi-supervised learning approach\nwith 60% labeled data achieves compara-\nble results (61.24% top-1 accuracy) with\nthe results of full datasets (61.8% top-1\naccuracy), demonstrating the effective-\nness of our approach for semi-supervised\nclassi\ufb01cation. Our model provides con-\nsistently better results than AlexNet [2]\nwhich has a similar \ufb01ve convolutional\nlayers architecture as ours. Our model is\noutperformed by GoogLeNet when more\nlabeled images are provided. This is not\nsurprising since GoogLeNet utilizes a considerably more complicated CNN architecture than ours.\nTo further illustrate the role of each component of our model, we replaced the Bayesian SVM with a\nsoftmax classi\ufb01er (see discussion at the end of Section 3.1). The softmax results are slightly worse,\nand provided in the SM. The gap between the results of Bayesian SVM and softmax are around 1%\nwhen the proportion of labeled images is less 30% and drop to around 0.5% when a larger proportion\nof labeled images is considered (larger than 30%). This further illustrates that the performance\ngain is primarily due to the semi-supervised learning framework used in our model, rather than the\ndiscriminative power of the SVM.\n\nFigure 1: Semi-supervised classi\ufb01cation accuracy on the\nvalidation set of ImageNet 2012.\n\nImage Captioning\n\n5.3\nWe present image captioning results on three benchmark datasets: Flickr8k [29], Flickr30k [30] and\nMicrosoft (MS) COCO [31]. These datasets contain 8000, 31000 and 123287 images, respectively.\nEach image is annotated with 5 sentences. For fair comparison, we use the same pre-de\ufb01ned splits\nfor all the datasets as in [5]. We use 1000 images for validation, 1000 for test and the rest for training\non Flickr8k and Flickr30k. For MS COCO, 5000 images are used for both validation and testing.\nThe widely used BLEU metric [32] and sentence perplexity (PPL) are employed to quantitatively\nevaluate the performance of our image captioning model. A low PPL indicates a better language\nmodel. For the MS COCO dataset, we further evaluate our model with metrics METEOR [33] and\nCIDEr [34]. Our joint model takes three days to train on MS COCO.\nWe show results for three models: (i) Two-step model: this model consists of our generative and\nrecognition model developed in Section 2 to analyze images alone, in an unsupervised manner. The\nextracted image features are fed to a separately trained RNN. (ii) Joint model: this is the joint model\ndeveloped in Sections 2 and 3.2. (iii) Joint model with ImageNet: in this model training is performed\nin a semi-supervised manner, with the training set of ImageNet 2012 treated as uncaptioned images,\nto complement the captioned training set.\nThe image captioning results are summarized in Table 3. Our two-step model achieves better\nperformance than similar baseline two-step methods, in which VggNet [3] and GoogLeNet [4] were\nused as feature extractors. The baseline VggNet and GoogLeNet models require labeled images for\ntraining, and hence are trained on ImageNet. By contrast, in our two-step approach, the deep model\nis trained in an unsupervised manner, using uncaptioned versions of images from the training set.\nThis fact may explain the improved quality of our results in Table 3.\n\n7\n\n010203040506070809015102030405060708090100Proportion (%) of Labeled ImagesAccuracy (%)AlexNet Top\u22121AlexNet Top\u22125GoogLeNet Top\u22121GoogLeNet Top\u22125Ours Top\u22121Ours Top\u22125\fTable 3: BLEU-1,2,3,4, METEOR, CIDEr and PPL metrics compared to other state-of-the-art\nresults and baselines on Flickr8k, Flickr 30k and MS COCO datasets.\nB-4 PPL B-1\n\nB-4 PPL\n\nFlickr30k\n\nFlickr8k\n\nMethod\n\nB-3\n\nB-3\n\nB-2\n\nB-2\n\nB-1\nBaseline results\n0.56\n0.56\n0.61\n\n0.37\n0.38\n0.41\n\nVggNet+RNN\n\nGoogLeNet+RNN\nOur two step model\n\n0.24\n0.24\n0.27\nOur results with other state-of-the-art results\n0.31\n0.33\n0.36\nState-of-the-art results using extra information\n0.38\n\nHard-Attention [6]\nOur joint model\n\n0.67\n0.70\n0.72\n\n0.46\n0.49\n0.52\n\n0.74\n\n0.54\n\nAttributes-CNN+RNN [7]\n\nOur joint model with ImageNet\n\n0.16\n0.16\n0.17\n\n0.21\n0.22\n0.25\n\n0.27\n\n15.71\n15.71\n15.82\n\n-\n\n15.24\n13.24\n\n0.57\n0.58\n0.61\n\n0.67\n0.69\n0.72\n\n0.38\n0.39\n0.41\n\n0.44\n0.50\n0.53\n\n0.25\n0.26\n0.27\n\n0.30\n0.35\n0.38\n\n0.17\n0.17\n0.17\n\n0.20\n0.22\n0.25\n\n0.55\n\n12.60\n0.73\n0.28\nMS COCO\nB-4 METEOR CIDEr PPL\n\n0.40\n\n18.83\n18.77\n18.73\n\n-\n\n16.17\n15.34\n\n15.96\n\n0.19\n0.17\n0.18\n\n0.26\n0.25\n0.26\n0.28\n\n0.31\n\n0.19\n0.19\n0.20\n\n0.24\n0.23\n0.22\n0.24\n\n0.26\n\n0.56\n0.55\n0.58\n\n-\n-\n\n0.89\n0.90\n\n13.16\n14.01\n13.46\n\n18.10\n\n-\n\n11.57\n11.14\n\n0.94\n\n10.49\n\nMethod\n\nB-2\n\nB-3\n\nB-1\nBaseline results\n0.61\n0.60\n0.61\n\nGoogLeNet+RNN\n\nVggNet+RNN\n\n0.28\n0.26\nOur two step\n0.27\nOur results with other state-of-the-art results\n-\n\nDMSM [28]\n\n0.42\n0.40\n0.42\n\n-\n\n-\n\nOur joint model with ImageNet\n\nHard-Attention [6]\nOur joint model\n\n0.36\n0.38\n0.37\nState-of-the-art results using extra information\n0.42\n\n0.50\n0.51\n0.52\n\n0.72\n0.71\n0.72\n\nAttributes-CNN+LSTM [7]\n\n0.74\n\n0.56\n\nIt is worth noting that our joint model yields signi\ufb01cant improvements over our two-step model,\nnearly 10% in average for BLEU scores, demonstrating the importance of inferring a shared latent\nstructure. It can also be seen that our improvement with semi-supervised use of ImageNet is most\nsigni\ufb01cant with the small/modest datasets (Flickr8k and Flickr30k), compared to the large dataset\n(MS COCO). Our model performs better than most image captioning systems. The only method\nwith better performance than ours is [7], which employs an intermediate image-to-attributes layer,\nthat requires determining an extra attribute vocabulary. Examples of generated captions from the\nvalidation set of ImageNet 2012, which has no ground truth captions and is unseen during training\n(the semi-supervised learning only uses the training set of ImageNet 2012), are shown in Figure 2.\n\nFigure 2: Examples of generated caption from unseen images on the validation dataset of ImageNet.\n\n6 Conclusions\n\nA recognition model has been developed for the Deep Generative Deconvolutional Network (DGDN)\n[8], based on a novel use of a deep CNN. The recognition model has been coupled with a Bayesian\nSVM and an RNN, to also model associated labels and captions, respectively. The model is learned\nusing a variational autoencoder setup, and allows semi-supervised learning (leveraging images without\nlabels or captions). The algorithm has been scaled up with a GPU-based implementation, achieving\nresults competitive with state-of-the-art methods on several tasks (and novel semi-supervised results).\n\nAcknowledgements\n\nThis research was supported in part by ARO, DARPA, DOE, NGA, ONR and NSF. The Titan X used\nin this work was donated by the NVIDIA Corporation.\n\n8\n\n a man with a snowboard next to a man with glasses a big black dog standing on the grass a player is holding a hockey stick a desk with a keyboard a man is standing next to a brown horse a box full of apples and oranges \fReferences\n[1] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropa-\n\ngation applied to handwritten zip code recognition. Neural Computation, 1989.\n\n[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\n[3] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\n[4] C. Szegedy, W. Liui, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.\n\nGoing deeper with convolutions. In CVPR, 2015.\n\n[5] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In\n\nnetworks. In NIPS, 2012.\n\nICLR, 2015.\n\nCVPR, 2015.\n\n[6] K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show,\n\nattend and tell: Neural image caption generation with visual attention. In ICML, 2015.\n\n[7] Q. Wu, C. Shen, A. Hengel, L. Liu, and A. Dick. What value do explicit high level concepts have in vision\n\n[8] Y. Pu, X. Yuan, A. Stevens, C. Li, and L. Carin. A deep generative deconvolutional image model. In\n\nto language problems? In CVPR, 2016.\n\nAISTATS, 2016.\n\n[9] Y. Pu, X. Yuan, and L. Carin. Generative deep deconvolutional learning. In ICLR workshop, 2015.\n[10] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In ICLR, 2014.\n[11] A. Mnih and K. Gregor. Neural variational inference and learning in belief networks. In ICML, 2014.\n[12] N. G. Polson and S. L. Scott. Data augmentation for support vector machines. Bayes. Anal., 2011.\n[13] T. D. Kulkarni, W.l Whitney, P. Kohli, and J. B. Tenenbaum. Deep convolutional inverse graphics network.\n\nIn NIPS, 2015.\n\n[14] A. Dosovitskiy, J. T. Springenberg, M. Tatarchenko, and T. Brox. Learning to generate chairs, tables and\n\ncars with convolutional networks. In CVPR, 2015.\n\n[15] C. Li, J. Zhu, T. Shi, and B. Zhang. Max-margin deep generative models. In NIPS, 2015.\n[16] V. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., 1995.\n[17] R. Henao, X. Yuan, and L. Carin. Bayesian nonlinear SVMs and factor modeling. NIPS, 2014.\n[18] S Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 1997.\n[19] K. Cho, B. V. Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning\n\nphrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, 2014.\n\n[20] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n[21] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS,\n\n2014.\n\n[22] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Bergeron, N. Bouchard, D. Warde-Farley,\n\nand Y. Bengio. Theano: new features and speed improvements. In NIPS Workshop, 2012.\n\n[23] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Computer Science\n\nDepartment, University of Toronto, Tech. Rep, 2009.\n\n[24] F. Li, F. Rob, and P. Perona. Learning generative visual models from few training examples: An incremental\n\nbayesian approach tested on 101 object categories. Computer Vision and Image Understanding, 2007.\n\n[25] G. Grif\ufb01n, A. Holub, and P. Perona. Caltech-256 object category dataset. 2007.\n[26] D.P. Kingma, S. Mohamed, D.J. Rezende, and M. Welling. Semi-supervised learning with deep generative\n\nmodels. In NIPS, 2014.\n\nnetworks. In NIPS, 2015.\n\n[27] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko. Semi-supervised learning with ladder\n\n[28] H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Doll\u00e1r, J. Gao, X. He, M. Mitchell, J. C. Platt,\n\nC. L. Zitnick, and G. Zweig. From captions to visual concepts and back. In CVPR, 2015.\n\n[29] M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a ranking task: Data, models and\n\nevaluation metrics. Journal of Arti\ufb01cial Intelligence Research, 2013.\n\n[30] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New\nsimilarity metrics for semantic inference over event descriptions. Transactions of the Association for\nComputational Linguistics, 2014.\n\n[31] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r, and C. L. Zitnick. Microsoft\n\nCOCO: Common objects in context. In ECCV, 2014.\n\n[32] K. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu: a method for automatic evaluation of machine\n\ntranslation. Transactions of the Association for Computational Linguistics, 2002.\n\n[33] S. Banerjee and A. Lavie. Meteor: An automatic metric for MT evaluation with improved correlation with\n\nhuman judgments. In ACL workshop, 2005.\n\n[34] R. Vedantam, Z. C. Lawrence, and D. Parikh. Cider: Consensus-based image description evaluation. In\n\nCVPR, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1228, "authors": [{"given_name": "Yunchen", "family_name": "Pu", "institution": "Duke University"}, {"given_name": "Zhe", "family_name": "Gan", "institution": "Duke"}, {"given_name": "Ricardo", "family_name": "Henao", "institution": "Duke University"}, {"given_name": "Xin", "family_name": "Yuan", "institution": "Bell Labs"}, {"given_name": "Chunyuan", "family_name": "Li", "institution": "Duke"}, {"given_name": "Andrew", "family_name": "Stevens", "institution": "Duke University"}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}]}