{"title": "Stochastic Multiple Choice Learning for Training Diverse Deep Ensembles", "book": "Advances in Neural Information Processing Systems", "page_first": 2119, "page_last": 2127, "abstract": "Many practical perception systems exist within larger processes which often include interactions with users or additional components that are capable of evaluating the quality of predicted solutions. In these contexts, it is beneficial to provide these oracle mechanisms with multiple highly likely hypotheses rather than a single prediction. In this work, we pose the task of producing multiple outputs as a learning problem over an ensemble of deep networks -- introducing a novel stochastic gradient descent based approach to minimize the loss with respect to an oracle. Our method is simple to implement, agnostic to both architecture and loss function, and parameter-free. Our approach achieves lower oracle error compared to existing methods on a wide range of tasks and deep architectures. We also show qualitatively that solutions produced from our approach often provide interpretable representations of task ambiguity.", "full_text": "Stochastic Multiple Choice Learning for\n\nTraining Diverse Deep Ensembles\n\nStefan Lee\nVirginia Tech\nste\ufb02ee@vt.edu\n\nSenthil Purushwalkam\n\nCarnegie Mellon University\nspurushw@andrew.cmu.edu\n\nMichael Cogswell\n\nVirginia Tech\n\ncogswell@vt.edu\n\nViresh Ranjan\nVirginia Tech\nrviresh@vt.edu\n\nDavid Crandall\nIndiana University\ndjcran@indiana.edu\n\nDhruv Batra\nVirginia Tech\ndbatra@vt.edu\n\nAbstract\n\nMany practical perception systems exist within larger processes that include inter-\nactions with users or additional components capable of evaluating the quality of\npredicted solutions. In these contexts, it is bene\ufb01cial to provide these oracle mecha-\nnisms with multiple highly likely hypotheses rather than a single prediction. In this\nwork, we pose the task of producing multiple outputs as a learning problem over an\nensemble of deep networks \u2013 introducing a novel stochastic gradient descent based\napproach to minimize the loss with respect to an oracle. Our method is simple\nto implement, agnostic to both architecture and loss function, and parameter-free.\nOur approach achieves lower oracle error compared to existing methods on a wide\nrange of tasks and deep architectures. We also show qualitatively that the diverse\nsolutions produced often provide interpretable representations of task ambiguity.\n\n1\n\nIntroduction\n\nPerception problems rarely exist in a vacuum. Typically, problems in Computer Vision, Natural\nLanguage Processing, and other AI sub\ufb01elds are embedded in larger applications and contexts. For\ninstance, the task of recognizing and segmenting objects in an image (semantic segmentation [6])\nmight be embedded in an autonomous vehicle [7], while the task of describing an image with a\nsentence (image captioning [18]) might be part of a system to assist visually-impaired users [22, 30].\nIn these scenarios, the goal of perception is often not to generate a single output but a set of plausible\nhypotheses for a \u2018downstream\u2019 process, such as a veri\ufb01cation component or a human operator. These\ndownstream mechanisms may be abstracted as oracles that have the capability to pick the correct\nsolution from this set. Such a learning setting is called Multiple Choice Learning (MCL) [8], where\nthe goal for the learner is to minimize oracle loss achieved by a set of M solutions. More formally,\ngiven a dataset of input-output pairs {(xi, yi) | xi \u2208 X , yi \u2208 Y}, the goal of classical supervised\nlearning is to search for a mapping F : X \u2192 Y that minimizes a task-dependent loss (cid:96) : Y\u00d7Y \u2192 R+\ncapturing the error between the actual labeling yi and predicted labeling \u02c6yi. In this setting, the learned\nfunction f makes a single prediction for each input and pays a penalty for that prediction. In contrast,\nMultiple Choice Learning seeks to learn a mapping g : X \u2192 Y M that produces M solutions\n\u02c6Yi = (\u02c6y1\nIn this work, we \ufb01x the form of this mapping g to be the union of outputs from an ensemble of\npredictors such that g(x) = {f1(x), f2(x), . . . , fM (x)}, and address the task of training ensemble\nmembers f1, . . . , fM such that g minimizes oracle loss. Under our formulation, different ensemble\nmembers are free to specialize on subsets of the data distribution, so that collectively they produce a\nset of outputs which covers the space of high probability predictions well.\n\ni ) such that oracle loss minm (cid:96) (yi, \u02c6ym\n\ni ) is minimized.\n\ni , . . . , \u02c6yM\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: Single-prediction based models often produce solutions with low expected loss in the face of ambiguity;\nhowever, these solutions are often unrealistic or do not re\ufb02ect the image content well (row 1). Instead, we train\nensembles under a uni\ufb01ed loss which allows each member to produce different outputs re\ufb02ecting multi-modal\nbeliefs (row 2). We evaluate our method on image classi\ufb01cation, segmentation, and captioning tasks.\n\nDiverse solution sets are especially useful for structured prediction problems with multiple reasonable\ninterpretations, only one of which is correct. Situations that often arise in practical systems include:\n\u2013 Implicit class confusion. The label space of many classi\ufb01cation problems is often an arbitrary\nquantization of a continuous space. For example, a vision system may be expected to classify\nbetween tables and desks, despite many real-world objects arguably belonging to both classes. By\nmaking multiple predictions, this implicit confusion can be viewed explicitly in system outputs.\n\n\u2013 Ambiguous evidence. Often there is simply not enough information to make a de\ufb01nitive prediction.\nFor example, even a human expert may not be able to identify a \ufb01ne-grained class (e.g., particular\nbreed of dog) given an occluded or distant view, but they likely can produce a small set of reasonable\nguesses. In such cases, the task of producing a diverse set of possibilities is more clearly de\ufb01ned\nthan producing one correct answer.\n\n\u2013 Bias towards the mode. Many models have a tendency to exhibit mode-seeking behaviors as a\nway to reduce expected loss over a dataset (e.g., a conversation model frequently producing \u2018I\ndon\u2019t know\u2019). By making multiple predictions, a system can improve coverage of lower density\nareas of the solution space, without sacri\ufb01cing performance on the majority of examples.\n\nIn other words, by optimizing for the oracle loss, a multiple-prediction learner can respond to\nambiguity much like a human does, by making multiple guesses that capture multi-modal beliefs.\nIn contrast, a single-prediction learner is forced to produce a solution with low expected loss in\nthe face of ambiguity. Figure 1 illustrates how this can produce solutions that are not useful in\npractice. In semantic segmentation, for example, this problem often causes objects to be predicted\nas a mixture of multiple classes (like the horse-cow shown in the \ufb01gure). In image captioning,\nminimizing expected loss encourages generic sentences that are \u2018safe\u2019 with respect to expected error\nbut not very informative. For example, Figure 1 shows two pairs of images each having different\nimage content but very similar, generic captions \u2013 the model knows it is safe to assume that birds are\non branches and that cakes are eaten with forks.\nIn this paper, we generalize the Multiple Choice Learning paradigm [8, 9] to jointly learn ensembles\nof deep networks that minimize the oracle loss directly. We are the \ufb01rst to formalize these ideas in\nthe context of deep networks and we present a novel training algorithm that avoids costly retraining\n[8] of past methods. Our primary technical contribution is the formulation of a stochastic block\ngradient descent optimization approach well-suited to minimizing the oracle loss in ensembles of\ndeep networks, which we call Stochastic Multiple Choice Learning (sMCL). Our formulation is\napplicable to any model trained with stochastic gradient descent, is agnostic to the form of the task\ndependent loss, is parameter-free, and is time ef\ufb01cient, training all ensemble members concurrently.\nWe demonstrate the broad applicability and ef\ufb01cacy of sMCL for training diverse deep ensembles\nwith interpretable emergent expertise on a wide range of problem domains and network architectures,\nincluding Convolutional Neural Network (CNN) [1] ensembles for image classi\ufb01cation [17], Fully-\nConvolutional Network (FCN) [20] ensembles for semantic segmentation [6], and combined CNN\nand Recurrent Neural Network (RNN) ensembles [14] for image captioning [18]. We provide detailed\nanalysis of the training and output behaviors of the resulting ensembles, demonstrating how ensemble\nmember specialization and expertise emerge automatically when trained using sMCL. Our method\noutperforms existing baselines and produces sets of outputs with high oracle performance.\n\n2\n\nHorseCowA\t\r \u00a0couple\t\r \u00a0of\t\r \u00a0birds\t\r \u00a0that\t\r \u00a0are\t\r \u00a0standing\t\r \u00a0in\t\r \u00a0the\t\r \u00a0grass.A\t\r \u00a0bird\t\r \u00a0perched\t\r \u00a0on\t\r \u00a0top\t\r \u00a0of\t\r \u00a0a\t\r \u00a0tree\t\r \u00a0branch.A\t\r \u00a0bird\t\r \u00a0perched\t\r \u00a0on\t\r \u00a0a\t\r \u00a0tree\t\r \u00a0branch\t\r \u00a0in\t\r \u00a0the\t\r \u00a0sky.\f2 Related Work\n\nEnsemble Learning. Much of the existing work on training ensembles focuses on diversity between\nmember models as a means to improve performance by decreasing error correlation. This is often\naccomplished by resampling existing training data for each member model [27] or by producing\narti\ufb01cial data that encourages new models to be decorrelated with the existing ensemble [21]. Other\napproaches train or combine ensemble members under a joint loss [19, 26]. More recently, work of\nHinton et al. [12] and Ahmed et al. [2] explores using \u2018generalist\u2019 network performance statistics to\ninform the design of ensemble-of-expert architectures for classi\ufb01cation. In contrast, sMCL discovers\nspecialization as a consequence of minimizing oracle loss. Importantly, most existing methods do\nnot generalize to structured output labels, while sMCL seamlessly adapts, discovering different\ntask-dependent specializations automatically.\nGenerating Multiple Solutions. There is a large body of work on the topic of extracting multiple\ndiverse solutions from a single model [3, 15, 16, 23, 24]; however, these approaches are designed for\nprobabilistic structured-output models and are not directly applicable to general deep architectures.\nMost related to our approach is the work of Guzman-Rivera et al. [8, 9] which explicitly minimizes\noracle loss over the outputs of an ensemble, formalizing this setting as the Multiple Choice Learning\n(MCL) paradigm. They introduce a general alternating block coordinate descent training approach\nwhich requires retraining models multiple times. Vondrick et al. [29] follow a similar methodology to\ntrain multi-modal regressors to predict the feature representations of future frames in video.\nRecently, Dey et al. [5] reformulated the problem of generating multiple diverse solutions as a\nsubmodular optimization task in which ensemble members are learned sequentially in a boosting-like\nmanner to maximize marginal gain in oracle performance. Both these methods require either costly\nretraining or sequential training, making them poorly suited to modern deep architectures that can\ntake weeks to train. To address this serious shortcoming and to provide the \ufb01rst practical algorithm for\ntraining diverse deep ensembles, we introduce a stochastic gradient descent (SGD) based algorithm\nto train ensemble members concurrently.\n\n3 Multiple-Choice Learning as Stochastic Block Gradient Descent\n\nWe consider the task of training an ensemble of differentiable learners that together produce a set of\nsolutions with minimal loss with respect to an oracle that selects only the lowest-error prediction.\nNotation. We use [n] to denote the set {1, 2, . . . , n}. Given a training set of input-output pairs\nD = {(xi, yi) | xi \u2208 X , yi \u2208 Y}, our goal is to learn a function g : X \u2192 Y M which maps\neach input to M outputs. We \ufb01x the form of g to be an ensemble of M learners f such that\ng(x) = (f1(x), . . . , fM (x)). For some task-dependent loss (cid:96)(y, \u02c6y), which measures the error\nbetween true and predicted outputs y and \u02c6y, we de\ufb01ne the oracle loss of g over the dataset D as\n\nLO(D) =\n\nmin\nm\u2208[M ]\n\n(cid:96) (yi, fm(xi)) .\n\nn(cid:88)\n\ni=1\n\nM(cid:88)\nn(cid:88)\nM(cid:88)\n\ni=1\n\nm=1\n\nMinimizing Oracle Loss with Multiple Choice Learning. In order to directly minimize the oracle\nloss for an ensemble of learners, Guzman-Rivera et al. [8] present an objective which forms a\n(potentially tight) upper-bound. This objective replaces the min in the oracle loss with indicator\nvariables (pi,m)M\n\nm=1 where pi,m is 1 if predictor m has the lowest error on example i,\n\npi,m (cid:96) (yi, fm(xi))\n\n(1)\n\nargmin\nfm,pi,m\n\ns.t.\n\npi,m = 1,\n\npi,m \u2208 {0, 1}.\n\nThe resulting minimization is a constrained joint optimization over ensemble parameters and data-\npoint assignments. The authors propose an alternating block algorithm, shown in Algorithm 1, to\napproximately minimize this objective. Similar to K-Means or \u2018hard-EM,\u2019 this approach alternates\nbetween assigning examples to their min-loss predictors and training models to convergence on the\npartition of examples assigned to them. Note that this approach is not feasible with training deep\nnetworks, since modern architectures [11] can take weeks or months to train a single model once.\n\n3\n\n\fFigure 2: The MCL approach of [8] (Alg. 1) requires costly retraining while our sMCL method (Alg. 2) works\nwithin standard SGD solvers, training all ensemble members under a joint loss.\n\nStochastic Multiple Choice Learning. To overcome this shortcoming, we propose a stochastic\nalgorithm for differentiable learners which interleaves the assignment step with batch updates in\nstochastic gradient descent. Consider the partial derivative of the objective in Eq. 1 with respect to\nthe output of the mth individual learner on example xi,\n\n\u2202LO\n\u2202fm(xi)\n\n= pi,m\n\n\u2202(cid:96)(yi, fm(xi))\n\n\u2202fm(xi)\n\n.\n\n(2)\n\nNotice that if fm is the minimum error predictor for example xi, then pi,m = 1, and the gradient\nterm is the same as if training a single model; otherwise, the gradient is zero. This behavior lends\nitself to a straightforward optimization strategy for learners trained by SGD based solvers. For each\nbatch, we pass the examples through the learners, calculating losses from each ensemble member for\neach example. During the backward pass, the gradient of the loss for each example is backpropagated\nonly to the lowest error predictor on that example (with ties broken arbitrarily).\nThis approach, which we call Stochastic Multiple Choice Learning (sMCL), is shown in Algorithm 2.\nsMCL is generalizable to any learner trained by stochastic gradient descent and is thus applicable to\nan extensive range of modern deep networks. Unlike the iterative training schedule of MCL, sMCL\nensembles need only be trained to convergence once in parallel. sMCL is also agnostic to the exact\nform of loss function (cid:96) such that it can be applied without additional effort on a variety of problems.\n\n4 Experiments\n\nIn this section, we present results for sMCL ensembles trained for the tasks and deep architectures\nshown in Figure 3. These include CNN ensembles for image classi\ufb01cation, FCN ensembles for\nsemantic segmentation, and a CNN+RNN ensembles for image caption generation.\nBaselines. Many existing general techniques for inducing diversity are not directly applicable to deep\nnetworks. We compare our proposed method against:\n- Classical ensembles in which each model is trained under an independent loss with differing\n\nrandom initializations. We will refer to these as Indp. ensembles in \ufb01gures.\n\n- MCL [8] that alternates between training models to convergence on assigned examples and\nallocating examples to their lowest error model. We repeat this process for 5 meta-iterations and\ninitialize ensembles with (different) random weights. We \ufb01nd MCL performs similarly to sMCL\non small classi\ufb01cation tasks; however, MCL performance drops substantially on segmentation and\ncaptioning tasks. Unlike sMCL which can effectively reassign an example once per epoch, MCL\nonly does this after convergence, limiting its capacity to specialize compared to sMCL. We also\nnote that sMCL is 5x faster than MCL, where the factor 5 is the result of choosing 5 meta-iterations\n(other applications may require more, further increasing the gap.)\n\n- Dey et al. [5] train models sequentially in a boosting-like fashion, each time reweighting examples\nto maximize marginal increase of the evaluation metric. We \ufb01nd these models saturate quickly as\nthe ensemble size grows. As performance increases, the marginal gain and therefore the weights\n\n4\n\n\f(a) Convolutional classi\ufb01cation\nmodel of [1] for CIFAR10 [17]\n\n(b) Fully-convolutional segmenta-\ntion model of Long et al. [20]\n\n(c) CNN+RNN based captioning\nmodel of Karpathy et al. [14]\n\nFigure 3: We experiment with three problem domains using the various architectures shown above.\n\napproach zero. With low weights, the average gradient backpropagated for stochastic learners drops\nsubstantially, reducing the rate and effectiveness of learning without careful tuning. To compute\nweights, [5] requires an error measure bounded above by 1: accuracy (for classi\ufb01cation) and IoU\n(for segmentation) satisfy this; the CIDEr-D score [28] divided by 10 guarantees this for captioning.\nOracle Evaluation. We present results as oracle versions of the task-dependent performance metrics.\nThese oracle metrics report the highest score over all outputs for a given input. For example, in\nclassi\ufb01cation tasks, oracle accuracy is exactly the top-k criteria of ImageNet [25], i.e. whether at\nleast one of the outputs is the correct label. Likewise, the oracle intersection over union (IoU) is the\nhighest IoU between the ground truth segmentation and any one of the outputs. Oracle metrics allow\nthe evaluation of multiple-prediction systems separately from downstream re-ranking or selection\nsystems, and have been extensively used in previous work [3, 5, 8, 9, 15, 16, 23, 24].\nOur experiments convincingly demonstrate the broad applicability and ef\ufb01cacy of sMCL for training\ndiverse deep ensembles. In all three experiments, sMCL signi\ufb01cantly outperforms classical ensembles,\nDey et al. [5] (typical improvements of 6-10%), and MCL (while providing a 5x speedup over MCL).\nOur analysis shows that the exact same algorithm (sMCL) leads to the automatic emergence of\ndifferent interpretable notions of specializations among ensemble members.\n\nImage Classi\ufb01cation\n\n4.1\nModel. We begin our experiments with sMCL on the CIFAR10 [17] dataset using the small convo-\nlutional neural network \u201cCIFAR10-Quick\u201d provided with the Caffe deep learning framework [13].\nCIFAR10 is a ten way classi\ufb01cation task with small 32\u00d732 images. For these experiments, the\nreference model is trained using a batch size of 350 for 5,000 iterations with a momentum of 0.9,\nweight decay of 0.004, and an initial learning rate of 0.001 which drops to 0.0001 after 4000 iterations.\nResults. Oracle accuracy for sMCL and baseline ensembles of size 1 to 6 are shown in Figure\n4a. The sMCL trained ensembles result in higher oracle accuracy than the baseline methods, and\nare comparable to MCL while being 5x faster. The method of Dey et al. [5] performs worse than\nindependent ensembles as ensemble size grows. Figure 4b shows the oracle loss during training for\nsMCL and regular ensembles. The sMCL trained models optimize for the oracle cross-entropy loss\ndirectly, not only arriving at lower loss solutions but also reducing error more quickly.\nInterpretable Expertise: sMCL Induces Label-Space Clustering. Figure 4c shows the class-wise\ndistribution of the assignment of test datapoints to the oracle or \u2018winning\u2019 predictor for an M = 4\nsMCL ensemble. The level of class division is striking \u2013 most predictors become specialists for\ncertain classes. Note that these divisions emerge from training under the oracle loss and are not\nhand-designed or pre-initialized in any way. In contrast, Figure 4f show that the oracle assignments\nfor a standard ensemble are nearly uniform. To explore the space between these two extremes, we\nloosen the constraints of Eq. 1 such that the lowest k error predictors are penalized. By varying k\nbetween 1 and the number of ensemble members M, the models transition from minimizing oracle\nloss at k = 1 to a traditional ensemble at k = M. Figures 4d and 4e show these results. We \ufb01nd\na direct correlation between the degree of specialization and oracle accuracy, with k = 1 netting\nhighest oracle accuracy.\n\n4.2 Semantic Segmentation\nWe now present our results for the semantic segmentation task on the Pascal VOC dataset [6].\nModel. We use the fully convolutional network (FCN) architecture presented by Long et al. [20]\nas our base model. Like [20], we train on the Pascal VOC 2011 training set augmented with extra\nsegmentations provided in [10] and we test on a subset of the VOC 2011 validation set. We initialize\n\n5\n\n\fy\nc\na\nr\nu\nc\nc\nA\ne\nl\nc\na\nr\nO\n\n95\n\n90\n\n85\n\n80\n\n1\n\n2\n\nsMCL\nDey [5]\n\nMCL\nIndp.\n\n3\n\nEnsemble Size M\n\n4\n\n5\n\n6\n\n4\n\n2\n\ns\ns\no\nL\ne\nl\nc\na\nr\nO\n\n0\n\n0\n\nsMCL\nIndp.\n\n2,500\n\nIterations\n\n5,000\n\n(a) Effect of Ensemble Size\n\n(b) Oracle Loss During Training (M = 4)\n\n(c) k=1\n\n(d) k=2\n\n(e) k=3\n\n(f) k=M=4\n\nFigure 4: sMCL trained ensembles produce higher oracle accuracies than baselines (a) by directly optimizing\nthe oracle loss (b). By varying the number of predictors k each example can be assigned to, we can interpolate\nbetween sMCL and standard ensembles, and (c-f) show the percentage of test examples of each class assigned\nto each ensemble member by the oracle for various k. These divisions are not preselected and show how\nspecialization is an emergent property of sMCL training.\n\nour sMCL models from a standard ensemble trained for 50 epochs at a learning rate of 10\u22123. The\nsMCL ensemble is then \ufb01ne-tuned for another 15 epochs at a reduced learning rate of 10\u22125.\nResults. Figure 5a shows oracle accuracy (class-averaged IoU) for all methods with ensemble sizes\nranging from 1 to 6. Again, sMCL signi\ufb01cantly outperforms all baselines (~7% relative improvement\nover classical ensembles). In this more complex setting, we see the method of Dey et al. [5] saturates\nmore quickly \u2013 resulting in performance worse than classical ensembles as ensemble size grows.\nThough we expect MCL to achieve similar results as sMCL, retraining the MCL ensembles a suf\ufb01cient\nnumber of times proved infeasible so results after \ufb01ve meta-iterations are shown.\nInterpretable Expertise: sMCL as Segmentation Specialists. In Figure 5b, we analyze the class\ndistribution of the predictions using an sMCL ensemble with 4 members. For each test sample, the\noracle picks the prediction which corresponds to the ensemble member with the highest accuracy\nfor that sample. We \ufb01nd the specialization with respect to classes is much less evident than in the\nclassi\ufb01cation experiments. As segmentation presents challenges other than simply selecting the\ncorrect class, specialization can occur in terms of shape and frequency of predicted segments in\naddition to class divisions; however, we do still see some class biases \u2013 network 2 captures cows,\ntables, and sofas well and network 4 has become an expert on sheep and horses.\nFigure 6 shows qualitative results from a four member sMCL ensemble. We can clearly observe\nthe diversity in the segmentations predicted by different members. In the \ufb01rst row, we see the\nmajority of the ensemble members produce dining tables of various completeness in response to the\nvisual uncertainty caused by the clutter. Networks 2 and 3 capture this ambiguity well, producing\nsegmentations with the dining table completely present or absent. Row 2 demonstrates the capacity\nof sMCL ensembles to provide multiple high quality solutions. The models are confused whether the\n\nU\no\nI\n\nn\na\ne\n\nM\n\ne\nl\nc\na\nr\nO\n\n75\n\n70\n\n65\n\n60\n\n1\n\nsMCL\nDey [5]\n\nMCL\nIndp.\n\n3\n\n2\nEnsemble Size M\n\n4\n\n5\n\n6\n\n(a) Effect of Ensemble Size\n\n(b) Oracle Assignment Distributions by Class\n\nFigure 5: a) sMCL trained ensembles consistently result in improved oracle mean IoU over baselines on PASCAL\nVOC 2011. b) Distribution of examples from each category assigned by the oracle for an sMCL ensemble.\n\n6\n\n0.10%0.20%99.50%0.10%37.60%0.10%99.90%0.00%0.00%0.00%99.60%0.00%0.10%99.90%0.00%0.00%0.10%99.90%0.00%0.00%0.10%99.80%0.30%0.00%62.40%0.00%0.00%0.00%100.00%0.20%0.20%0.00%0.10%0.00%0.00%99.90%0.00%0.10%0.00%99.80%0123airplaineautomobilebirdcatdeerdogfroghorseshiptruck70.60%0.00%0.00%0.00%55.80%63.30%27.70%38.90%0.00%68.40%0.10%0.00%78.80%0.00%0.00%0.10%72.30%0.00%80.00%0.00%29.20%22.20%19.30%62.90%44.20%0.00%0.00%60.10%0.00%31.40%0.10%77.80%1.90%37.10%0.00%36.60%0.00%1.00%20.00%0.20%012328.50%36.20%27.90%0.00%24.90%61.20%50.40%23.20%0.10%35.40%39.90%0.00%47.60%71.30%57.40%0.00%0.00%0.00%57.60%0.00%31.60%38.10%0.00%20.50%0.00%24.00%33.30%49.40%11.60%28.00%0.00%25.70%24.50%8.20%17.70%14.80%16.30%27.40%30.70%36.60%012322.60%30.30%19.70%26.30%20.00%29.30%17.30%26.30%25.30%23.80%33.20%20.30%27.70%26.40%23.60%21.40%18.30%26.80%22.70%20.60%25.20%26.10%26.30%24.30%31.70%27.90%32.50%22.60%24.40%27.10%19.00%23.30%26.30%23.00%24.70%21.40%31.90%24.30%27.60%28.50%0123Net \u00a01Net \u00a02Net \u00a03Net \u00a04\fIndependent\n\nEnsemble Oracle\n\nsMCL Ensemble Predictions\n\nIoU 82.64\n\nIoU 77.11\n\nIoU 88.12\n\nIoU 58.70\n\nIoU 52.78\n\nIoU 54.26\n\nIoU 56.45\n\nIoU 62.03\n\nIoU 47.68\n\nIoU 37.73\n\nIoU 20.31\n\nIoU 21.34\nNet 1\n\nIoU 14.17\nNet 2\n\nIoU 94.55\nNet 3\n\nIoU 19.18\nNet 4\n\nInput\n\nFigure 6: Samples images and corresponding predictions obtained by each member of the sMCL ensemble as\nwell as the top output of a classical ensemble. The output with minimum loss on each example is outlined in red.\nNotice that sMCL ensembles vary in the shape, class, and frequency of predicted segments.\n\nanimal is a horse or a cow \u2013 models 1 and 3 produce typical \u2018safe\u2019 responses while models 2 and 4\nattempt to give cohesive responses. Finally, row 3 shows how the models can learn biases about the\nfrequency of segments with model 3 presenting only the sheep.\n\nImage Captioning\n\n4.3\nIn this section, we show that sMCL trained ensembles can produce sets of high quality and diverse\nsentences, which is essential to improving recall and capturing ambiguities in language and perception.\nModel. We adopt the model and training procedure of Karpathy et al. [14], utilizing their publicly\navailable implementation neuraltalk2. The model consists of an VGG16 network [4] which encodes\nthe input image as a \ufb01xed-length representation for a Long Short-Term Memory (LSTM) language\nmodel. We train and test on the MSCOCO dataset [18], using the same splits as [14]. We perform two\nexperimental setups by either freezing or \ufb01netuning the CNN. In the \ufb01rst, we freeze the parameters\nof the CNN and train multiple LSTM models using the CNN as a static feature generator. In the\nsecond, we aggregate and back-propagate the gradients from each LSTM model through the CNN in\na tree-like model structure. This is largely a construct of memory restrictions as our hardware could\nnot accommodate multiple VGG16 networks. We train each ensemble for 70k iterations with the\nparameters of the CNN \ufb01xed. For the \ufb01ne-tuning experiments, we perform another 70k iterations of\ntraining to \ufb01ne-tune the CNN. We generate sentences for testing by performing beam search with a\nbeam width of two (following [14]).\nResults. Table 1 presents the oracle CIDEr-D [28] scores for all methods on the validation set. We\nadditionally compare with all outputs of a beam search over a single CNN+LSTM model with beam\nwidth ranging from 1 to 5. sMCL signi\ufb01cantly outperforms the baseline ensemble learning methods\n(shown in the upper section of the table), increasing both oracle performance and the number of\nunique n-grams. For M = 5, beam search from a single model achieves greater oracle but produces\nsigni\ufb01cantly fewer unique n-grams. We note that beam search is an inference method and increased\nbeam width could provide similar bene\ufb01ts for sMCL ensembles.\n\nOracle CIDEr-D for Ensemble of Size\n\nM = 1\n\n2\n\n3\n\n4\n\n5\n\n# Unique n-Grams (M=5)\n4\n\n2\n\nn = 1\n\n3\n\nsMCL\nMCL [8]\nDey [5]\nIndp.\n\n-\n-\n-\n\n0.684\n\nsMCL (\ufb01ne-tuned CNN)\nIndp. (\ufb01ne-tuned CNN)\n\n-\n\n0.912\n\n0.822\n0.752\n0.798\n0.757\n\n1.064\n1.001\n\n0.862\n0.81\n0.850\n0.784\n\n1.130\n1.05\n\n0.911\n0.823\n0.887\n0.809\n\n1.179\n1.073\n\n0.922\n0.852\n0.910\n0.831\n\n1.184\n1.095\n\nBeam Search\n\n0.654\n\n0.754\n\n0.833\n\n0.888\n\n0.943\n\n713\n384\n584\n540\n\n1135\n921\n\n580\n\n2902\n1565\n2266\n2003\n\n6028\n4335\n\n2272\n\n6464\n3586\n4969\n4312\n\n15184\n10534\n\n15427\n9551\n12208\n10297\n\n35518\n23811\n\n4920\n\n12920\n\nAvg.\nLength\n\n10.21\n9.87\n10.26\n10.24\n\n10.43\n10.33\n\n10.62\n\nTable 1: sMCL base methods outperform other ensemble methods a captioning, improve both oracle performance\nand the number of distinct n-grams. For low M, sMCL also performs better than multiple-output decoders.\n\n7\n\n\fInput\n\nIndependently Trained Networks\n\nsMCL Ensemble\n\nA man riding a wave on top of a surfboard.\nA man riding a wave on top of a surfboard.\nA man riding a wave on top of a surfboard.\nA man riding a wave on top of a surfboard.\n\nA man riding a wave on top of a surfboard.\nA person on a surfboard in the water.\nA surfer is riding a wave in the ocean.\nA surfer riding a wave in the ocean.\n\nA group of people standing on a sidewalk.\nA man is standing in the middle of the street.\nA group of people standing around a \ufb01re hydrant.\nA group of people standing around a \ufb01re hydrant\n\nA man is walking down the street with an umbrell.\nA group of people sitting at a table with umbrellas.\nA group of people standing around a large plane.\nA group of people standing in front of a building\n\nA kitchen with a stove and a microwave.\nA white refrigerator freezer sitting inside of a kitchen.\nA white refrigerator sitting next to a window.\nA white refrigerator freezer sitting in a kitchen\n\nA cat sitting on a chair in a living room.\nA kitchen with a stove and a sink.\nA cat is sitting on top of a refrigerator.\nA cat sitting on top of a wooden table\n\nA bird is sitting on a tree branch.\nA bird is perched on a branch in a tree.\nA bird is perched on a branch in a tree.\nA bird is sitting on a tree branch\n\nA small bird perched on top of a tree branch.\nA couple of birds that are standing in the grass.\nA bird perched on top of a branch.\nA bird perched on a tree branch in the sky\n\nFigure 7: Comparison of sentences generated by members of a standard independently trained ensemble and an\nsMCL based ensemble of size four.\n\nIntepretable Expertise: sMCL as N-Gram Specialists. Figure 7 shows example images and gen-\nerated captions from standard and sMCL ensembles of size four (results from beam search over a\nsingle model are similar). It is evident that the independently trained models tend to predict similar\nsentences independent of initialization, perhaps owing to the highly structured nature of the output\nspace and the mode bias of the underlying language model. On the other hand, the sMCL based\nensemble generates diverse sentences which capture ambiguity both in language and perception. The\n\ufb01rst row shows an extreme case in which all of the members of the standard ensemble predict identical\nsentences. In contrast, the sMCL ensemble produces sentences that describe the scene with many\ndifferent structures. In row three, both models are confused about the content of the image, mistaking\nthe pile of suitcases as kitchen appliances. However, the sMCL ensemble widens the scope of some\nsentences to include the cat clearly depicted in the image. The fourth row is an example of regression\ntowards the mode, with the standard model producing multiple similar sentences describing birds on\nbranches. In the sMCL ensemble, we also see this tendency; however, one model breaks away and\ncaptures the true content of the image.\n\n5 Conclusion\n\nTo summarize, we propose Stochastic Multiple Choice Learning (sMCL), an SGD-based technique\nfor training diverse deep ensembles that follows a \u2018winner-take-gradient\u2019 training strategy. Our\nexperiments demonstrate the broad applicability and ef\ufb01cacy of sMCL for training diverse deep\nensembles. In all experimental settings, sMCL signi\ufb01cantly outperforms classical ensembles and\nother strong baselines including the 5x slower MCL procedure. Our analysis shows that exactly the\nsame algorithm (sMCL) automatically generates specializations among ensemble members along\ndifferent task-speci\ufb01c dimensions. sMCL is simple to implement, agnostic to both architecture and\nloss function, parameter free, and simply involves introducing one new sMCL layer into existing\nensemble architectures.\n\nAcknowledgments\nThis work was supported in part by a National Science Foundation CAREER award, an Army Research Of\ufb01ce YIP\naward, ICTAS Junior Faculty award, Of\ufb01ce of Naval Research grant N00014-14-1-0679, Google Faculty Research\naward, AWS in Education Research grant, and NVIDIA GPU donation, all awarded to DB, and by an NSF CAREER\naward (IIS-1253549), the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory\ncontract FA8650-12-C-7212, a Google Faculty Research award, and an NVIDIA GPU donation, all awarded to DC.\nComputing resources used by this work are supported in part by NSF (ACI-0910812 and CNS-0521433), the Lily\nEndowment, Inc., and the Indiana METACyt Initiative. The U.S. Government is authorized to reproduce and distribute\nreprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and\nconclusions contained herein are those of the authors and should not be interpreted as necessarily representing the\nof\ufb01cial policies or endorsements, either expressed or implied, of IARPA, AFRL, NSF, or the U.S. Government.\n\n8\n\n\fReferences\n[1] CIFAR-10 Quick Network Tutorial. http://caffe.berkeleyvision.org/gathered/examples/cifar10.html, 2016.\n[2] K. Ahmed, M. H. Baig, and L. Torresani. Network of experts for large-scale image categorization. In\n\narXiv preprint arXiv:1604.06119, 2016.\n\n[3] D. Batra, P. Yadollahpour, A. Guzman-Rivera, and G. Shakhnarovich. Diverse M-Best Solutions in Markov\n\nRandom Fields. In Proceedings of European Conference on Computer Vision (ECCV), 2012.\n\n[4] K. Chat\ufb01eld, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep\n\ninto convolutional nets. arXiv preprint arXiv:1405.3531, 2014.\n\n[5] D. Dey, V. Ramakrishna, M. Hebert, and J. Andrew Bagnell. Predicting multiple structured visual\n\ninterpretations. In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2015.\n\n[6] M. Everingham, L. Van Gool, C. K.\n\nI. Williams,\n\nJ. Winn, and A. Zisserman.\n\nPASCAL Visual Object Classes Challenge 2011 (VOC2011) Results.\nnetwork.org/challenges/VOC/voc2011/workshop/index.html.\n\nThe\nhttp://www.pascal-\n\n[7] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets Robotics: The KITTI Dataset. International\n\nJournal of Robotics Research (IJRR), 2013.\n\n[8] A. Guzman-Rivera, D. Batra, and P. Kohli. Multiple Choice Learning: Learning to Produce Multiple\n\nStructured Outputs. In Advances in Neural Information Processing Systems (NIPS), 2012.\n\n[9] A. Guzman-Rivera, P. Kohli, D. Batra, and R. Rutenbar. Ef\ufb01ciently enforcing diversity in multi-output\nIn Proceedings of the International Conference on Arti\ufb01cial Intelligence and\n\nstructured prediction.\nStatistics (AISTATS), 2014.\n\n[10] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In\n\nProceedings of IEEE International Conference on Computer Vision (ICCV), 2011.\n\n[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint\n\narXiv:1512.03385, 2015.\n\n[12] G. E. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In Advances in Neural\n\nInformation Processing Systems (NIPS) - Deep Learning Workshop, 2014.\n\n[13] Y. Jia. Caffe: An open source convolutional architecture for fast feature embedding. http://caffe.\n\nberkeleyvision.org/, 2013.\n\n[14] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In\n\nProceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.\n\n[15] A. Kirillov, B. Savchynskyy, D. Schlesinger, D. Vetrov, and C. Rother. Inferring m-best diverse solutions\n\nin a single one. In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2015.\n\n[16] A. Kirillov, D. Schlesinger, D. Vetrov, C. Rother, and B. Savchynskyy. M-best-diverse labelings for\nsubmodular energies and beyond. In Advances in Neural Information Processing Systems (NIPS), 2015.\n\n[17] A. Krizhevsky. Learning multiple layers of features from tiny images, 2009.\n[18] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r, and C. L. Zitnick. Microsoft\n\nCOCO: Common objects in context, 2014.\n\n[19] Y. Liu and X. Yao. Ensemble learning via negative correlation. Neural Networks, 12(10):1399\u20131404, 1999.\n[20] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In\n\nProceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.\n\n[21] P. Melville and R. J. Mooney. Creating diversity in ensembles using arti\ufb01cial data. Information Fusion,\n\n6(1):99\u2013111, 2005.\n\n[22] Microsoft. Decades of computer vision research, one \u2018Swiss Army knife\u2019. blogs.microsoft.com\n\n/next/2016/03/30/decades-of-computer-vision-research-one-swiss-army-knife/, 2016.\n\n[23] D. Park and D. Ramanan. N-best maximal decoders for part models. In Proceedings of IEEE International\n\nConference on Computer Vision (ICCV), pages 2627\u20132634, 2011.\n\n[24] A. Prasad, S. Jegelka, and D. Batra. Submodular meets structured: Finding diverse subsets in exponentially-\n\nlarge structured item sets. In Advances in Neural Information Processing Systems (NIPS), 2014.\n\n[25] O. Russakovsky, J. Deng, J. Krause, A. Berg, and L. Fei-Fei. The ImageNet Large Scale Visual Recognition\n\nChallenge 2012 (ILSVRC2012). http://www.image-net.org/challenges/LSVRC/2012/.\n\n[26] A. Strehl and J. Ghosh. Cluster ensembles\u2014a knowledge reuse framework for combining multiple\n\npartitions. The Journal of Machine Learning Research, 3:583\u2013617, 2003.\n\n[27] K. Tumer and J. Ghosh. Error correlation and error reduction in ensemble classi\ufb01ers. Connection Science,\n\n8(3-4):385\u2013404, 1996.\n\n[28] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation.\n\nIn Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.\n\n[29] C. Vondrick, H. Pirsiavash, and A. Torralba. Anticipating visual representations from unlabeled video. In\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 98\u2013106, 2016.\n[30] WIRED. Facebook\u2019s AI Can Caption Photos for the Blind on Its Own. wired.com/2015/10/facebook-\n\narti\ufb01cial-intelligence-describes-photo-captions-for-blind-people/, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1112, "authors": [{"given_name": "Stefan", "family_name": "Lee", "institution": "Indiana University"}, {"given_name": "Senthil", "family_name": "Purushwalkam Shiva Prakash", "institution": "Carnegie Mellon"}, {"given_name": "Michael", "family_name": "Cogswell", "institution": "Virginia Tech"}, {"given_name": "Viresh", "family_name": "Ranjan", "institution": "Virginia Tech"}, {"given_name": "David", "family_name": "Crandall", "institution": "Indiana University"}, {"given_name": "Dhruv", "family_name": "Batra", "institution": "Virginia Tech"}]}