{"title": "Multi-Task Zipping via Layer-wise Neuron Sharing", "book": "Advances in Neural Information Processing Systems", "page_first": 6016, "page_last": 6026, "abstract": "Future mobile devices are anticipated to perceive, understand and react to the world on their own by running multiple correlated deep neural networks on-device. Yet the complexity of these neural networks needs to be trimmed down both within-model and cross-model to fit in mobile storage and memory. Previous studies focus on squeezing the redundancy within a single neural network. In this work, we aim to reduce the redundancy across multiple models. We propose Multi-Task Zipping (MTZ), a framework to automatically merge correlated, pre-trained deep neural networks for cross-model compression. Central in MTZ is a layer-wise neuron sharing and incoming weight updating scheme that induces a minimal change in the error function. MTZ inherits information from each model and demands light retraining to re-boost the accuracy of individual tasks. Evaluations show that MTZ is able to fully merge the hidden layers of two VGG-16 networks with a 3.18% increase in the test error averaged on ImageNet and CelebA, or share 39.61% parameters between the two networks with <0.5% increase in the test errors for both tasks. The number of iterations to retrain the combined network is at least 17.8 times lower than that of training a single VGG-16 network. Moreover, experiments show that MTZ is also able to effectively merge multiple residual networks.", "full_text": "Multi-Task Zipping via Layer-wise Neuron Sharing\n\nXiaoxi He\nETH Zurich\nhex@ethz.ch\n\nZimu Zhou\u2217\nETH Zurich\n\nzzhou@tik.ee.ethz.ch\n\nLothar Thiele\nETH Zurich\n\nthiele@ethz.ch\n\nAbstract\n\nFuture mobile devices are anticipated to perceive, understand and react to the world\non their own by running multiple correlated deep neural networks on-device. Yet\nthe complexity of these neural networks needs to be trimmed down both within-\nmodel and cross-model to \ufb01t in mobile storage and memory. Previous studies\nsqueeze the redundancy within a single model. In this work, we aim to reduce the\nredundancy across multiple models. We propose Multi-Task Zipping (MTZ), a\nframework to automatically merge correlated, pre-trained deep neural networks\nfor cross-model compression. Central in MTZ is a layer-wise neuron sharing and\nincoming weight updating scheme that induces a minimal change in the error\nfunction. MTZ inherits information from each model and demands light retraining\nto re-boost the accuracy of individual tasks. Evaluations show that MTZ is able\nto fully merge the hidden layers of two VGG-16 networks with a 3.18% increase\nin the test error averaged on ImageNet and CelebA, or share 39.61% parameters\nbetween the two networks with < 0.5% increase in the test errors for both tasks.\nThe number of iterations to retrain the combined network is at least 17.8\u00d7 lower\nthan that of training a single VGG-16 network. Moreover, experiments show that\nMTZ is also able to effectively merge multiple residual networks.\n\n1\n\nIntroduction\n\nAI-powered mobile applications increasingly demand multiple deep neural networks for correlated\ntasks to be performed continuously and concurrently on resource-constrained devices such as wear-\nables, smartphones, self-driving cars, and drones [5, 18]. While many pre-trained models for different\ntasks are available [14, 23, 25], it is often infeasible to deploy them directly on mobile devices. For\ninstance, VGG-16 models for object detection [25] and facial attribute classi\ufb01cation [17] both contain\nover 130M parameters. Packing multiple such models easily strains mobile storage and memory.\nSharing information among tasks holds potential to reduce the sizes of multiple correlated models\nwithout incurring drop in individual task inference accuracy.\nWe study information sharing in the context of cross-model compression, which seeks effective and\nef\ufb01cient information sharing mechanisms among pre-trained models for multiple tasks to reduce the\nsize of the combined model without accuracy loss in each task. A solution to cross-model compression\nis multi-task learning (MTL), a paradigm that jointly learns multiple tasks to improve the robustness\nand generalization of tasks [1, 5]. However, most MTL studies use heuristically con\ufb01gured shared\nstructures, which may lead to dramatic accuracy loss due to improper sharing of knowledge [31].\nSome recent proposals [17, 19, 28] automatically decide \u201cwhat to share\u201d in deep neural networks.\nYet deep MTL usually involves enormous training overhead [31]. Hence it is inef\ufb01cient to ignore the\nalready trained parameters in each model and apply MTL for cross-model compression.\nWe propose Multi-Task Zipping (MTZ), a framework to automatically and adaptively merge correlated,\nwell-trained deep neural networks for cross-model compression via neuron sharing. It decides the\n\n\u2217\n\nCorresponding Author: Zimu Zhou.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\foptimal sharable pairs of neurons on a layer basis and adjusts their incoming weights such that\nminimal errors are introduced in each task. Unlike MTL, MTZ inherits the parameters of each\nmodel and optimizes the information to be shared among models such that only light retraining\nis necessary to resume the accuracy of individual tasks. In effect, it squeezes the inter-network\nredundancy from multiple already trained deep neural networks. With appropriate hardware support,\nMTZ can be further integrated with existing proposals for single-model compression, which reduce\nthe intra-network redundancy via pruning [4, 6, 8, 15] or quantization [2, 7].\nThe contributions and results of this work are as follows.\n\n\u2022 We propose MTZ, a framework that automatically merges multiple correlated, pre-trained\ndeep neural networks. It squeezes the task relatedness across models via layer-wise neuron\nsharing, while requiring light retraining to re-boost the accuracy of the combined model.\n\u2022 Experiments show that MTZ is able to merge all the hidden layers of two LeNet net-\nworks [14] (differently trained on MNIST) without increase in test errors. MTZ manages to\nshare 39.61% parameters between the two VGG-16 networks pre-trained for object detection\n(on ImageNet [24]) and facial attribute classi\ufb01cation (on CelebA [16]), while incurring less\nthan 0.5% increase in test errors. Even when all the hidden layers are fully merged, there is\na moderate (averaged 3.18%) increase in test errors for both tasks. MTZ achieves the above\nperformance with at least 17.9\u00d7 fewer iterations than training a single VGG-16 network\nfrom scratch [25]. In addition, MTZ is able to share 90% of the parameters among \ufb01ve\nResNets on \ufb01ve different visual recognition tasks while inducing negligible loss on accuracy.\n\n2 Related Work\n\nMulti-task Learning. Multi-task learning (MTL) leverages the task relatedness in the form of\nshared structures to jointly learn multiple tasks [1]. Our MTZ resembles MTL in effect, i.e., sharing\nstructures among related tasks, but differs in objectives. MTL jointly trains multiple tasks to improve\ntheir generalization, while MTZ aims to compress multiple already trained tasks with mild training\noverhead. Georgiev et al. [5] are the \ufb01rst to apply MTL in the context of multi-model compression.\nHowever, as in most MTL studies, the shared topology is heuristically con\ufb01gured, which may lead\nto improper knowledge transfer [29]. Only a few schemes optimize what to share among tasks,\nespecially for deep neural networks. Yang et al. propose to learn cross-task sharing structure at\neach layer by tensor factorization [28]. Cross-stitching networks [19] learn an optimal shared and\ntask-speci\ufb01c representations using cross-stitch units. Lu et al. automatically grow a wide multi-task\nnetwork architecture from a thin network by branching [17]. Similarly, Rebuf\ufb01 et al. sequentially add\nnew tasks to a main task using residual adapters for ResNets [21]. Different to the above methods,\nMTZ inherits the parameters directly from each pre-trained network when optimizing the neurons\nshared among tasks in each layer and demands light retraining.\nSingle-Model Compression. Deep neural networks are typically over-parameterized [3]. There\nhave been various model compression proposals to reduce the redundancy in a single neural net-\nwork. Pruning-based methods sparsify a neural network by eliminating unimportant weights (con-\nnections) [4, 6, 8, 15]. Other approaches reduce the dimensions of a neural network by neuron\ntrimming [11] or learning a compact (yet dense) network via knowledge distillation [22, 10]. The\nmemory footprint of a neural network can be further reduced by lowering the precision of param-\neters [2, 7].Unlike previous research that deals with the intra-redundancy of a single network, our\nwork reduces the inter-redundancy among multiple networks. In principle, our method is a dimension\nreduction based cross-model compression scheme via neuron sharing. Although previous attempts\ndesigned for a single network may apply, they either adopt heuristic neuron similarity criterion [11]\nor require training a new network from scratch [22, 10]. Our neuron similarity metric is grounded\nupon parameter sensitivity analysis for neural networks, which is applied in single-model weight\npruning [4, 8, 15]. Our work can be integrated with single-model compression to further reduce the\nsize of the combined network.\n\n3 Layer-wise Network Zipping\n3.1 Problem Statement\n\nConsider two inference tasks A and B with the corresponding two well-trained models M A and M B,\ni.e., trained to a local minimum in error. Our goal is to construct a combined model M C by sharing\n\n2\n\n\f(\n\nl (cid:16)\n\n1) :\n\nA\n(cid:108) 1\nlN (cid:16)\n\nA\n\nlW\n\n(cid:105) 1lN (cid:16)\n\nB\n(cid:108) 1\nlN (cid:16)\n\nA\n(cid:108) 1\nlN (cid:16)\n\n(cid:105) 1lN (cid:16)\n\nB\n(cid:108) 1\nlN (cid:16)\n\nB\n\nlW\n\n(cid:106) A\nlW\n\n(cid:109) A\nlW\n\n(cid:106) B\nlW\n\n(cid:106) lW\n\n(cid:109) B\nlW\n\n:l\n\ni\n\nlN\nA\n\nj\n\nlN\nB\n\n(cid:108) A\nlN\n\n(cid:105) lN\n\n(cid:108) B\nlN\n\n(\n\nl (cid:14)\n\n1) :\n\nFigure 1: An illustration of layer zipping via neuron sharing: neurons and the corresponding weight\nmatrices (a) before and (b) after zipping the l-th layers of M A and M B.\n\nl\u22121\u00d7N B\nN B\n\nl and N B\n\nl , where N A\n\nas many neurons between layers in M A and M B as possible such that (i) M C has minimal loss in\ninference accuracy for the two tasks and (ii) the construction of M C involves minimal retraining.\nFor ease of presentation, we explain our method with two feed-forward networks of dense fully\nconnected (FC) layers. We extend MTZ to convolutional (CONV) layers in Sec. 3.5, sparse layers\nin Sec. 3.6 and residual networks (ResNets) in Sec. 3.7. We assume the same input domain and the\nsame number of layers in M A and M B.\n3.2 Layer Zipping via Neuron Sharing: Fully Connected Layers\nThis subsection presents the procedure of zipping the l-th layers (1 \u2264 l \u2264 L \u2212 1) in M A and M B\ngiven the previous (l \u2212 1) layers have been merged (see Fig. 1). We denote the input layers as the\n0-th layers. The L-th layers are the output layers of M A and M B.\nl \u2208\nDenote the weight matrices of the l-th layers in M A and M B as WA\nl are the numbers of neurons in the l-th layers in M A and M B.\nR\nAssume \u02dcNl\u22121 \u2208 [0, min{N A\nl\u22121}] neurons are shared between the (l \u2212 1)-th layers in M A and\nl\u22121 \u2212 \u02dcNl\u22121 and \u02c6N B\nl\u22121 \u2212 \u02dcNl\u22121 task-speci\ufb01c neurons left in\nM B. Hence there are \u02c6N A\nthe (l \u2212 1)-th layers in M A and M B, respectively.\nNeuron Sharing. To enforce neuron sharing between the l-th layers in M A and M B, we calculate the\nfunctional difference between the i-th neuron in layer l in M A, and the j-th neuron in the same layer\n\u02dcNl\u22121\nin M B. The functional difference is measured by a metric d[ \u02dcwA\nare the incoming weights of the two neurons from the shared neurons in the (l \u2212 1)-th layer. We do\nnot alter incoming weights from the non-shared neurons in the (l \u2212 1)-th layer because they are likely\nto contain task-speci\ufb01c information only.\nTo zip the l-th layers in M A and M B, we \ufb01rst calculate the functional difference for each pair\nof neurons (i, j) in layer l and select \u02dcNl \u2208 [0, min{N A\nl }] pairs with the smallest functional\ndifference. These pairs of neurons form a set {(ik, jk)}, where k = 0,\u00b7\u00b7\u00b7 , \u02dcNl and each pair is\nmerged into one neuron. Thus the neurons in the l-th layers in M A and M B fall into three groups:\n\u02dcNl shared, \u02c6N A\n\nl \u2212 \u02dcNl speci\ufb01c for A and \u02c6N B\n\nl \u2212 \u02dcNl speci\ufb01c for B.\n\nl\u22121, N B\nl\u22121 = N A\n\nl,j], where \u02dcwA\n\nl and WB\n\nl,j \u2208 R\n\nl \u2208 R\n\nl\u22121 = N B\n\nl\u22121\u00d7N A\nN A\n\nl = N B\n\nl = N A\n\nl,i, \u02dcwB\n\nl,i, \u02dcwB\n\nl , N B\n\nl,ik and \u02dcwB\n\n\u02dcNl\u22121\u00d7 \u02dcNl, whose columns are \u02dcwl,k = f ( \u02dcwA\n\nWeight Matrices Updating. Finally the weight matrices WA\nl and WB\nl are re-organized as follows.\nl,jk, where k = 0,\u00b7\u00b7\u00b7 , \u02dcNl, are merged and replaced by a matrix\nThe weights vectors \u02dcwA\n\u02dcWl \u2208 R\n), where f (\u00b7) is an incoming weight\nupdate function. \u02dcWl represents the task-relatedness between A and B from layer (l \u2212 1) to layer l.\nThe incoming weights from the \u02c6N A\nl neurons in layer l in M A\nl\u22121\u00d7 \u02dcNl.\nform a matrix \u02c6WA\nl contain the task-speci\ufb01c information for A between layer (l \u2212 1) and layer\nMatrices \u02c6WA\nl\u22121\u00d7 \u02dcNl in a similar manner.\nl. For task B, we organize matrices \u02c6WB\nWe also adjust the order of rows in the weight matrices in the (l + 1)-th layers, WA\nl+1, to\nmaintain the correct connections among neurons.\n\nl are packed as \u02dcWA\nl \u2208 R\n\nl\u22121 neurons in layer (l \u2212 1) to the \u02c6N A\n\nl . The remaining columns in WA\n\nl \u2208 R\nl and \u02dcWA\n\nl+1 and WB\n\nl and \u02dcWB\n\nl \u2208 R\n\nl \u2208 R\n\nl\u22121\u00d7 \u02c6N B\nN B\n\nl\u22121\u00d7 \u02c6N A\nN A\n\n, \u02dcwB\n\n\u02c6N B\n\n\u02c6N A\n\nl,jk\n\nl,ik\n\n3\n\n\fThe above layer zipping process can reduce \u02dcNl\u22121 \u00d7 \u02dcNl weights from WA\nl . Essential in\nMTZ are the neuron functional difference metric d[\u00b7] and the incoming weight update function f (\u00b7).\nThey are designed to demand only light retraining to recover the original accuracy.\n\nl and WB\n\n3.3 Neuron Functional Difference and Incoming Weight Update\nThis subsection introduces our neuron functional difference metric d[\u00b7] and weight update function\nf (\u00b7) leveraging previous research on parameter sensitivity analysis for neural networks [4, 8, 15].\nPreliminaries. A naive approach to accessing the impact of a change in some parameter vector \u03b8 on\nthe objective function (training error) E is to apply the parameter change and re-evaluate the error on\nthe entire training data. An alternative is to exploit second order derivatives [4, 8]. Speci\ufb01cally, the\nTaylor series of the change \u03b4E in training error due to certain parameter vector change \u03b4\u03b8 is [8]:\n\n(cid:2)\n\n(cid:3)(cid:3)\n\n\u03b4E =\n\n\u2202E\n\u2202\u03b8\n\n\u00b7 \u03b4\u03b8 +\n\n1\n2\n\n\u03b4\u03b8(cid:3) \u00b7 H \u00b7 \u03b4\u03b8 + O((cid:4)\u03b4\u03b8(cid:4)3)\n\n(1)\n\nwhere H = \u22022E/\u2202\u03b82 is the Hessian matrix containing all the second order derivatives. For a network\ntrained to a local minimum in E, the \ufb01rst term vanishes. The third and higher order terms can also be\nignored [8]. Hence:\n\n\u03b4E =\n\n\u03b4\u03b8(cid:3) \u00b7 H \u00b7 \u03b4\u03b8\n\n1\n2\n\n(2)\n\nEq.(2) approximates the deviation in error due to parameter changes. However, it is still a bottleneck\nto compute and store the Hessian matrix H of a modern deep neural network. As next, we harness the\ntrick in [4] to break the calculations of Hessian matrices into layer-wise, and propose a Hessian-based\nneuron difference metric as well as the corresponding weight update function for neuron sharing.\nMethod. Inspired by [4] we de\ufb01ne the error functions of M A and M B in layer l as\n\n(cid:4)\n(cid:4)\n\nEA\n\nl =\n\nEB\n\nl =\n\n1\nnA\n1\nnB\n\n(cid:4)\u02dcyA\n(cid:4)\u02dcyB\n\nl (cid:4)2\nl \u2212 yA\nl \u2212 yB\nl (cid:4)2\n\n(3)\n\n(4)\n\nl and \u02dcyA\n\nwhere yA\nl are the pre-activation outputs of the l-th layers in M A before and after layer\nzipping, evaluated on one instance from the training set of A; yB\nl are de\ufb01ned in a similar\nway; (cid:4) \u00b7 (cid:4) is l2-norm; nA and nB are the number of training samples for M A and M B, respectively;\n\u03a3 is the summation over all training instances. Since M A and M B are trained to a local minimum in\ntraining error, EA\nl will have the same minimum points as the corresponding training errors.\nWe further de\ufb01ne an error function of the combined network in layer l as\n\nl and EB\n\nl and \u02dcyB\n\nEl = \u03b1EA\n\n(5)\nwhere \u03b1 \u2208 (0, 1) is used to balance the errors of M A and M B. The change in El with respect to\nneuron sharing in the l-th layer can be expressed in a similar form as Eq.(2):\nl,j \u00b7 \u03b4 \u02dcwB\n\nl,j)(cid:3) \u00b7 \u02dcHB\n\nl,i)(cid:3) \u00b7 \u02dcHA\n\nl,i \u00b7 \u03b4 \u02dcwA\n\n\u03b4El =\n\n(\u03b4 \u02dcwB\n\n(\u03b4 \u02dcwA\n\nl,i +\n\n(6)\n\nl,j\n\nl + (1 \u2212 \u03b1)EB\n\nl\n\n1\n2\n\n1\n2\n\nl,i and \u03b4 \u02dcwB\n\nl,i)2 and \u02dcHB\n\nl,i = \u22022El/(\u2202 \u02dcwA\n\nwhere \u03b4 \u02dcwA\nl,j are the adjustments in the weights of i and j to merge the two neurons;\n\u02dcHA\nl,j = \u22022El/(\u2202 \u02dcwB\nl,j)2 denote the layer-wise Hessian matrices. Similarly\n(cid:4)\nto [4], the layer-wise Hessian matrices can be calculated as\ni\u22121 \u00b7 (xA\n\u03b1\nxA\n(cid:4)\nnA\n1 \u2212 \u03b1\nnB\n\n(8)\nj\u22121 are the outputs of the merged neurons from layer (l \u2212 1) in M A and M B,\n\ni\u22121)(cid:3)\nj\u22121 \u00b7 (xB\nxB\n\ni\u22121 and xB\n\nj\u22121)(cid:3)\n\nl,j =\n\nl,i =\n\n\u02dcHB\n\n\u02dcHA\n\n(7)\n\nwhere xA\nrespectively.\n\n4\n\n\fWhen sharing the i-th and j-th neurons in the l-th layers of M A and M B, respectively, our aim is to\nminimize \u03b4El, which can be formulated as the optimization problem below:\nl,j + \u03b4 \u02dcwB\nl,j\n\n\u03b4El} s.t. \u02dcwA\n\nl,i + \u03b4 \u02dcwA\n\nl,i = \u02dcwB\n\n(9)\n\nmin\n(i,j)\n\n{ min\nl,i,\u03b4 \u02dcwB\n\n(\u03b4 \u02dcwA\n\nl,j )\n\nApplying the method of Lagrange multipliers, the optimal weight changes and the resulting \u03b4El are:\n\n(cid:5)\n( \u02dcHA\n(cid:5)\n\nl,i)\u22121 \u00b7\nl,j)\u22121 \u00b7\nl,i \u2212 \u02dcwB\n\n( \u02dcwA\n\n(cid:5)\n\n( \u02dcHA\nl,j)(cid:3) \u00b7\n\nl,i)\u22121 + ( \u02dcHB\nl,i)\u22121 + ( \u02dcHB\n\nl,j)\u22121\nl,j)\u22121\n\n\u03b4 \u02dcwA,opt\n\nl,i =( \u02dcHA\n\n\u03b4 \u02dcwB,opt\n\nl,j =( \u02dcHB\n\n\u03b4Eopt\n\nl =\n\n1\n2\n\n(cid:6)\u22121 \u00b7 ( \u02dcwB\n(cid:6)\u22121 \u00b7 ( \u02dcwA\nl,j)\u22121\n\nl,j \u2212 \u02dcwA\nl,i)\nl,i \u2212 \u02dcwB\n(cid:6)\u22121 \u00b7 ( \u02dcwA\nl,j)\nl,i \u2212 \u02dcwB\nl,j)\n\n( \u02dcHA\n\nl,i)\u22121 + ( \u02dcHB\n\n(10)\n\n(11)\n\n(12)\n\nFinally, we de\ufb01ne the neuron functional difference metric d[ \u02dcwA\nl,j + \u03b4 \u02dcwB,opt\nupdate function f ( \u02dcwA\n\nl,i + \u03b4 \u02dcwA,opt\n\nl,i = \u02dcwB\n\nl,j) = \u02dcwA\n\nl,i, \u02dcwB\n\nl,j\n\n.\n\nl,i, \u02dcwB\n\nl,j] = \u03b4Eopt\n\nl\n\n, and the weight\n\n3.4 MTZ Framework\n\nAlgorithm 1 outlines the process of MTZ on two tasks of the same input domain, e.g., images. We\n\ufb01rst construct a joint input layer. In case the input layer dimensions are not equal in both tasks, the\ndimension of the joint input layer equals the larger dimension of the two original input layers, and\n\ufb01ctive connections (i.e., weight 0) are added to the model whose original input layers are smaller.\nAfterwards we begin layer-wise neuron sharing and weight matrix updating from the \ufb01rst hidden layer.\nThe two networks are \u201czipped\u201d layer by layer till the last hidden layer and we obtain a combined\nnetwork. After merging each layer, the networks are retrained to re-boost the accuracy.\nPractical Issues. We make the following notes on the practicability of MTZ.\n\nl,ik\n\nl,jk\n\n, \u02dcwB\n\nL(cid:4) [4].\n\nL \u2212 xB\n\n(cid:7)(cid:4)\u02dcxB\n\n\u2022 How to set the number of neurons to be shared? One can directly set \u02dcNl neurons to be\nshared for the l-th layers, or set a layer-wise threshold \u03b5l instead. Given a threshold \u03b5l, MTZ\n] < \u03b5l}. In this case \u02dcNl = |{(ik, jk)}|.\nshares pairs of neurons where {(ik, jk)|d[ \u02dcwA\nOne can set { \u02dcNl} if there is a hard constraint on storage or memory. Otherwise {\u03b5l} can be\nset if accuracy is of higher priority. Note that {\u03b5l} controls the layer-wise error \u03b4El, which\nL(cid:4) and\ncorrelates to the accumulated errors of the outputs in layer L \u02dc\u03b5A = 1\u221a\nnA\n\u02dc\u03b5B = 1\u221a\nnB\n\nl } and { \u02dcWl}, while { \u02dcWB\n\n\u2022 How to execute the combined model for each task? During inference, only task-related\nconnections in the combined model are enabled. For instance, when performing inference on\nl } are disabled\ntask A, we only activate { \u02c6WA\n(e.g., by setting them to zero).\n\u2022 How to zip more than two neural networks? MTZ is able to zip more than two models by\nsequentially adding each network into the joint network, and the calculated Hessian matrices\nof the already zipped joint network can be reused. Therefore, MTZ is scalable in regards to\nboth the depth of each network and the number of tasks to be zipped. Also note that since\ncalculating the Hessian matrix of one layer requires only its layer input, only one forward\npass in total from each model is needed for the merging process (excluding retraining).\n\nl } and { \u02c6WB\n\n(cid:7)(cid:4)\u02dcxA\n\nl }, { \u02dcWA\n\nL \u2212 xA\n\n3.5 Extension to Convolutional Layers\n\nThe layer zipping procedure of two convolutional layers are very similar to that of two fully connected\nlayers. The only difference is that sharing is performed on kernels rather than neurons. Take the\ni-th kernel of size kl \u00d7 kl in layer l of M A as an example. Its incoming weights from the previous\nshared kernels are \u02dcWA,in\nl,i to calculate\nfunctional differences. As with in Sec. 3.2, after layer zipping in the l-th layers, the weight matrices\nin the (l + 1)-th layers need careful permutations regarding the \ufb02attening ordering to maintain correct\nconnections among neurons, especially when the next layers are fully connected layers.\n\nkl\u00d7kl\u00d7 \u02dcNl\u22121. The weights are then \ufb02atten into a vector \u02dcwA\n\nl,i \u2208 R\n\n5\n\n\fl }, {WB\n\nAlgorithm 1: Multi-task Zipping via Layer-wise Neuron Sharing\ninput\n\n:{WA\nXA, XB: training datum of task A and B (including labels)\n\u03b1: coef\ufb01cient to adjust the zipping balance of M A and M B\n{ \u02dcNl}: number of neurons to be shared in layer l\n\nl }: weight matrices of M A and M B\n\n1 for l = 1, . . . , L \u2212 1 do\n\nl\u22121 using training data from XA and XB\n\n2\n\n3\n\n4\n\n5\n\n6\n7\n\n8\n9\n10\n\nl\u22121 and xB\n\nCalculate inputs for the current layer xA\n(cid:7)\nand forward propagation\nl,i \u2190 \u03b1\ni\u22121 \u00b7 (xA\n(cid:7)\n\u02dcHA\nxA\nl,j \u2190 1\u2212\u03b1\nj\u22121 \u00b7 (xB\n\u02dcHB\nxB\nSelect \u02dcNl pairs of neurons {(ik, jk)} with the smallest d[ \u02dcwA\nfor k \u2190 1, . . . , \u02dcNl do\n\ni\u22121)(cid:3)\nj\u22121)(cid:3)\n\nnB\n\nnA\n\nl,i, \u02dcwB\nl,j]\n\n)\n\n, \u02dcwB\n\nl,ik\nl,jk\nl and WB\nl\n\n\u02dcwl,k \u2190 f ( \u02dcwA\nRe-organize WA\nPermute the order of rows in WA\nConduct a light retraining on task A and B to re-boost accuracy of the joint model\nl },{ \u02dcWl},{ \u02dcWB\n\nl+1 to maintain correct connections\nl }: weights of the zipped multi-task model M C\n\nl+1 and WB\nl },{ \u02c6WB\n\ninto \u02dcWl, \u02c6WA\n\nl },{ \u02dcWA\n\nl and \u02dcWB\n\nl , \u02c6WB\n\nl , \u02dcWA\n\nl\n\noutput :{ \u02c6WA\n\n3.6 Extension to Sparse Layers\n\nSince the pre-trained neural networks may have already been sparsi\ufb01ed via weight pruning, we also\nextend MTZ to support sparse models. Speci\ufb01cally, we use sparse matrices, where zeros indicate no\nconnections, to represent such sparse models. Then the incoming weights from the previous shared\nneurons/kernels \u02dcwA\nl,j) can\nbe calculated as usual. However, we also calculate two mask vectors \u02dcmA\nl,j, whose elements\nare 0 when the corresponding elements in \u02dcwA\nl,j are 0, and 1 otherwise. We pick the mask\nvector with more 1(cid:5)s and apply it to \u02dcwl. This way the combined model always have a smaller number\nof connections (weights) than the sum of the original two models.\n\nl,j still have the same dimension. Therefore d[ \u02dcwA\n\nl,i and \u02dcmB\n\nl,i and \u02dcwB\n\nl,j], f ( \u02dcwA\n\nl,i, \u02dcwB\n\nl,i, \u02dcwB\n\nl,i, \u02dcwB\n\n3.7 Extension to Residual Networks\n\nMTZ can also be extended to merge residual networks [9]. To simplify the merging process, we\nassume that the last layer is always fully-merged when merging the next layers. Hence after merging\nl\u22121}\u00d7 \u02dcNl.\nwe have only matrices \u02c6WA\nThis assumption is able to provide decent performance (see Sec. 4.3). Note that the sequence of the\nchannels of the shortcuts need to be permuted before and after the adding operation at the end of each\nresidual block in order to maintain correct connections after zipping.\n\nl , and \u02dcWl \u2208 R\n\nl \u2208 R\n\nl \u2208 R\n\nl\u22121\u00d7 \u02c6N B\nN B\n\nl\u22121\u00d7 \u02c6N A\nN A\n\nl , \u02c6WB\n\nmin{N A\n\nl\u22121,N B\n\n4 Experiments\nWe evaluate the performance of MTZ on zipping networks pre-trained for the same task (Sec. 4.1)\nand different tasks (Sec. 4.2 and Sec. 4.3). We mainly assess the test errors of each task after network\nzipping and the retraining overhead involved. MTZ is implemented with TensorFlow.All experiments\nare conducted on a workstation equipped with Nvidia Titan X (Maxwell) GPU.\n\n4.1 Performance to Zip Two Networks (LeNet) Pre-trained for the Same Task\n\nThis experiment validates the effectiveness of MTZ by merging two differently trained models for\nthe same task. Ideally, two models trained to different local optimums should function the same on\nthe test data. Therefore their hidden layers can be fully merged without incurring any accuracy loss.\nThis experiment aims to show that, by \ufb01nding the correct pairs of neurons which shares the same\nfunctionality, MTZ can achieve the theoretical limit of compression ratio i.e., 100%, even without\nany retraining involved.\n\n6\n\n\f(a)\n\n(b)\n\nFigure 2: Test error on MNIST by continually sharing neurons in (a) the \ufb01rst and (b) the second fully\nconnected layers of two dense LeNet-300-100 networks till the merged layers are fully shared.\n\nTable 1: Test errors on MNIST by sharing all neurons in two LeNet networks.\n\nModel\n\nerrA\n\nerrB\n\nre-errC\n\n# re-iter\n\nLeNet-300-100-Dense\nLeNet-300-100-Sparse\nLeNet-5-Dense\nLeNet-5-Sparse\n\n1.57% 1.60% 1.64% 550\n1.80% 1.81% 1.83% 800\n0.89% 0.95% 0.93% 600\n1.27% 1.28% 1.29% 1200\n\nDataset and Settings. We experiment on MNIST dataset with the LeNet-300-100 and LeNet-5\nnetworks [14] to recognize handwritten digits from zero to nine. LeNet-300-100 is a fully connected\nnetwork with two hidden layers (300 and 100 neurons each), reporting an error from 1.6% to 1.76%\non MNIST [4][14]. LeNet-5 is a convolutional network with two convolutional layers and two fully\nconnected layers, which achieves an error ranging from 0.8% to 1.27% on MNIST [4][14].\nWe train two LeNet-300-100 networks of our own with errors of 1.57% and 1.60%; and two LeNet-5\nnetworks with errors of 0.89% and 0.95%. All the networks are initialized randomly with different\nseeds, and the training data are also shuf\ufb02ed before every training epoch. After training, the ordering\nof neurons/kernels in all hidden layers is once more randomly permuted. Therefore the models have\ncompletely different parameters (weights). The training of LeNet-300-100 and LeNet-5 networks\nrequires 1.05 \u00d7 104 and 1.1 \u00d7 104 iterations in average, respectively.\nFor sparse networks, we apply one iteration of L-OBS [4] to prune the weights of the four LeNet\nnetworks. We then enforce all neurons to be shared in each hidden layer of the two dense LeNet-\n300-100 networks, sparse LeNet-300-100 networks, dense LeNet-5 networks, and sparse LeNet-5\nnetworks, using MTZ.\nResults. Fig. 2a plots the average error after sharing different amounts of neurons in the \ufb01rst layers\nof two dense LeNet-300-100 networks. Fig. 2b shows the error by further merging the second layers.\nWe compare MTZ with a random sharing scheme, which shares neurons by \ufb01rst picking (ik, jk) at\nrandom, and then choosing randomly between \u02dcwA\nl,jk as the shared weights \u02dcwlk. When\nall the 300 neurons in the \ufb01rst hidden layers are shared, there is an increase of 0.95% in test error\n(averaged over the two models) even without retraining, while random sharing induces an error\nof 33.47%. We also experiment MTZ to fully merge the hidden layers in the two LeNet-300-100\nnetworks without any retraining i.e., without line 10 in Algorithm 1. The averaged test error increases\nby only 1.50%.\nTable 1 summarizes the errors of each LeNet pair before zipping (errA and errB), after fully merged\nwith retraining (re-errC) and the number of retraining iterations involved (# re-iter). MTZ consistently\nachieves lossless network zipping on fully connected and convolutional networks, either they are\ndense or sparse, with 100% parameters of hidden layers shared. Meanwhile, the number of retraining\niterations is approximately 19.0\u00d7 and 18.7\u00d7 fewer than that of training a dense LeNet-300-100\nnetwork and a dense LeNet-5 network, respectively.\n\nl,ik and \u02dcwB\n\n7\n\n\fTable 2: Test errors and retraining iterations of sharing all neurons (output layer fc8 excluded) in two\nwell-trained VGG-16 networks for ImageNet and CelebA.\n\nLayer\n\nconv1_1\nconv1_2\nconv2_1\nconv2_2\nconv3_1\nconv3_2\nconv3_3\nconv4_1\nconv4_2\nconv4_3\nconv5_1\nconv5_2\nconv5_3\n\nfc6\nfc7\n\nN A\nl\n\n64\n64\n128\n128\n256\n256\n256\n512\n512\n512\n512\n512\n512\n\n4096\n4096\n\nImageNet (Top-5 Error)\n\nCelebA (Error)\n\nw/o-re-errC\n\nre-errC\n\nw/o-re-errC\n\nre-errC\n\n# re-iter\n\n10.59%\n11.19%\n10.99%\n11.31%\n11.65%\n11.92%\n12.54%\n13.40%\n13.02%\n13.11%\n13.46%\n13.77%\n36.07%\n\n15.08%\n15.73%\n\n10.61% 8.45%\n10.78% 8.82%\n10.68% 8.91%\n11.03% 9.23%\n11.46% 9.16%\n11.83% 9.17%\n12.41% 9.46%\n12.28% 10.18%\n12.62% 10.65%\n12.97% 12.03%\n13.09% 12.62%\n13.20% 12.61%\n13.35% 13.10%\n\n15.17% 12.31%\n14.07% 11.98%\n\n50\n8.43%\n100\n8.77%\n100\n8.82%\n100\n9.07%\n100\n9.04%\n100\n9.05%\n100\n9.34%\n9.69%\n400\n10.25% 400\n10.92% 400\n11.68% 400\n11.64% 400\n12.01% 1 \u00d7 103\n11.71% 2 \u00d7 103\n11.09% 1 \u00d7 104\n\nTable 3: Test errors, number of shared neurons, and retraining iterations of adaptively zipping two\nwell-trained VGG-16 networks for ImageNet and CelebA.\n\nLayer\n\nconv1_1\nconv1_2\nconv2_1\nconv2_2\nconv3_1\nconv3_2\nconv3_3\nconv4_1\nconv4_2\nconv4_3\nconv5_1\nconv5_2\nconv5_3\n\nfc6\nfc7\n\nN A\nl\n\n64\n64\n128\n128\n256\n256\n256\n512\n512\n512\n512\n512\n512\n\n\u02dcNl\n\n64\n64\n96\n96\n192\n192\n192\n384\n320\n320\n436\n436\n436\n\n4096\n4096\n\n1792\n4096\n\nImageNet (Top-5 Error)\n\nCelebA (Error)\n\nw/o-re-errC\n\nre-errC\n\nw/o-re-errC\n\nre-errC\n\n# re-iter\n\n10.28%\n10.93%\n10.74%\n10.87%\n10.83%\n10.92%\n10.86%\n10.69%\n10.43%\n10.56%\n10.42%\n10.47%\n10.49%\n\n11.46%\n11.45%\n\n10.37% 8.39%\n10.50% 8.77%\n10.57% 8.62%\n10.79% 8.56%\n10.76% 8.62%\n10.71% 8.52%\n10.71% 8.83%\n10.51% 9.39%\n10.46% 9.06%\n10.36% 9.36%\n10.51% 9.54%\n10.49% 9.43%\n10.24% 9.61%\n\n11.33% 9.37%\n10.75% 9.15%\n\n8.33% 50\n8.54% 100\n8.46% 100\n8.47% 100\n8.48% 100\n8.44% 100\n8.63% 100\n8.71% 400\n8.80% 400\n8.93% 400\n9.15% 400\n9.16% 400\n9.07% 1 \u00d7 103\n9.18% 2 \u00d7 103\n8.95% 1.5 \u00d7 104\n\n4.2 Performance to Zip Two Networks (VGG-16) Pre-trained for Different Tasks\n\nThis experiment evaluates the performance of MTZ to automatically share information among two\nneural networks for different tasks. We investigate: (i) what the accuracy loss is when all hidden\nlayers of two models for different tasks are fully shared (in purpose of maximal size reduction); (ii)\nhow much neurons and parameters can be shared between the two models by MTZ with at most 0.5%\nincrease in test errors allowed (in purpose of minimal accuracy loss).\nDataset and Settings. We explore to merge two VGG-16 networks trained on the ImageNet ILSVRC-\n2012 dataset [24] for object classi\ufb01cation and the CelabA dataset [16] for facial attribute classi\ufb01cation.\nThe ImageNet dataset contains images of 1, 000 object categories. The CelebA dataset consists of\n200 thousand celebrity face images labelled with 40 attribute classes. VGG-16 is a deep convolutional\nnetwork with 13 convolutional layers and 3 fully connected layers. We directly adopt the pre-trained\nweights from the original VGG-16 model [25] for the object classi\ufb01cation task, which has a 10.31%\nerror in our evaluation. For the facial attribute classi\ufb01cation task, we train a second VGG-16 model\nfollowing a similar process as in [17]. We initialize the convolutional layers of a VGG-16 model\nusing the pre-trained parameters from imdb-wiki [23], then train the remaining 3 fully connected\nlayers till the model yields an error of 8.50%, which matches the accuracy of the VGG-16 model\nused in [17] on CelebA. We conduct two experiments with the two VGG-16 models. (i) All hidden\nlayers in the two models are 100% merged using MTZ. (ii) Each pair of layers in the two models are\nadaptively merged using MTZ allowing an increase (< 0.5%) in test errors on the two datasets.\n\n8\n\n\fTable 4: Test errors of pre-trained single ResNets and the joint network merged by MTZ. 1\u00d7 is the\nnumber of parameters of one single ResNet excluding the last classi\ufb01cation layer.\n\n5\u00d7 Single model\nJoint model\n\nGTSR\n\nOGlt\n\nmean\n\n#par.\n5\u00d7\n1.5\u00d7 29.13%\n\nC100\n29.19% 1.48% 14.40% 6.86% 37.83% 17.95%\n18.20%\n\n39.04%\n\nSVHN\n\nUCF\n\n0.09%\n\n15.65%\n\n7.08%\n\nResults. Table 2 summarizes the performance when each pair of hidden layers are 100% merged.\nThe test errors of both tasks gradually increase during the zipping procedure from layer conv1_1\nto conv5_2 and then the error on ImageNet surges when conv5_3 are 100% shared. After 1, 000\niterations of retraining, the accuracies of both tasks are resumed. When 100% parameters of all\nhidden layers are shared between the two models, the joint model yields test errors of 14.07% on\nImageNet and 11.09% on CelebA, i.e., increases of 3.76% and 2.59% in the original test errors.\nTable 3 shows the performance when each pair of hidden layers are adaptively merged. Ultimately,\nMTZ achieves an increase in test errors of 0.44% on ImageNet and 0.45% on CelebA. Approximately\n39.61% of the parameters in the two models are shared (56.94% in the 13 convolutional layers\nand 38.17% in the 2 fully connected layers). The zipping procedure involves 20, 650 iterations of\nretraining. For comparison, at least 3.7 \u00d7 105 iterations are needed to train a VGG-16 network from\nscratch [25]. That is, MTZ is able to inherit information from the pre-trained models and construct a\ncombined model with an increase in test errors of less than 0.5%. And the process requires at least\n17.9\u00d7 fewer (re)training iterations than training a joint network from scratch.\nFor comparison, we also trained a fully shared multi-task VGG-16 with two split classi\ufb01cation layers\njointly on both tasks. The test errors are 14.88% on ImageNet and 13.29% on CelebA. This model\nhas exactly the same topology and amount of parameters as our model constructed by MTZ, but\nperforms slightly worse on both tasks.\n\n4.3 Performance to Zip Multiple Networks (ResNets) Pre-trained for Different Tasks\n\nThis experiment shows the performance of MTZ to merge more than two neural networks for different\ntasks, where the model for each task is pre-trained using deeper architectures such as ResNets.\nDataset and Settings. We adopt the experiment settings similar to [21], a recent work on multi-task\nlearning with ResNets. Speci\ufb01cally, \ufb01ve ResNet28 networks [30] are trained for diverse recognition\ntasks, including CIFAR100 (C100) [12], German Traf\ufb01c Sign Recognition (GTSR) Benchmark [27],\nOmniglot (OGlt) [13], Street View House Numbers (SVHN) [20] and UCF101 (UCF) [26]. We set\nthe same 90% compression ratio for the \ufb01ve models and evaluate the performance of MTZ by the\naccuracy of the joint model on each task.\nResults. Table 4 shows the accuracy of each individual pre-trained model and the joint model on the\n\ufb01ve tasks. The average accuracy decrease is a negligible 0.25%. Although ResNets are much deeper\nand have more complex topology compared to VGG-16, MTZ is still able to effectively reduce the\noverall number of parameters, while retaining the accuracy on each task.\n\n5 Conclusion\nWe propose MTZ, a framework to automatically merge multiple correlated, well-trained deep neural\nnetworks for cross-model compression via neuron sharing. It selectively shares neurons and optimally\nupdates their incoming weights on a layer basis to minimize the errors induced to each individual task.\nOnly light retraining is necessary to resume the accuracy of the joint model on each task. Evaluations\nshow that MTZ can fully merge two VGG-16 networks with an error increase of 3.76% and 2.59% on\nImageNet for object classi\ufb01cation and on CelebA for facial attribute classi\ufb01cation, or share 39.61%\nparameters between the two models with < 0.5% error increase. The number of iterations to retrain\nthe combined model is 17.9\u00d7 lower than that of training a single VGG-16 network. Meanwhile,\nMTZ is able to share 90% of the parameters among \ufb01ve ResNets on \ufb01ve different visual recognition\ntasks while inducing negligible loss on accuracy. Preliminary experiments also show that MTZ is\napplicable to sparse networks. We plan to further investigate the integration of MTZ with weight\npruning in the future.\n\n9\n\n\fReferences\n[1] Rich Caruana. Multitask learning. Machine Learning, 28(1):41\u201375, 1997.\n\n[2] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep\nneural networks with binary weights during propagations. In Proceedings of Advances in Neural\nInformation Processing Systems, pages 3123\u20133131, 2015.\n\n[3] Misha Denil, Babak Shakibi, Laurent Dinh, Nando De Freitas, et al. Predicting parameters in\ndeep learning. In Proceedings of Advances in Neural Information Processing Systems, pages\n2148\u20132156, 2013.\n\n[4] Xin Dong, Shangyu Chen, and Sinno Pan. Learning to prune deep neural networks via layer-\nwise optimal brain surgeon. In Proceedings of Advances in Neural Information Processing\nSystems, pages 4860\u20134874, 2017.\n\n[5] Petko Georgiev, Sourav Bhattacharya, Nicholas D Lane, and Cecilia Mascolo. Low-resource\nmulti-task audio sensing for mobile and embedded devices via shared deep neural network\nrepresentations. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous\nTechnologies, 1(3):50:1\u201350:19, 2017.\n\n[6] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for ef\ufb01cient dnns. In\nProceedings of Advances In Neural Information Processing Systems, pages 1379\u20131387, 2016.\n\n[7] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural\nnetworks with pruning, trained quantization and huffman coding. In Proceedings of International\nConference on Learning Representations, 2016.\n\n[8] Babak Hassibi and David G. Stork. Second order derivatives for network pruning: Optimal\nbrain surgeon. In Proceedings of Advances in Neural Information Processing Systems, pages\n164\u2013171, 1993.\n\n[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition,\npages 770\u2013778, 2016.\n\n[10] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.\n\narXiv preprint arXiv:1503.02531, 2015.\n\n[11] Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. Network trimming: A data-driven\nneuron pruning approach towards ef\ufb01cient deep architectures. arXiv preprint arXiv:1607.03250,\n2016.\n\n[12] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nTechnical report, 2009.\n\n[13] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept\n\nlearning through probabilistic program induction. Science, 350(6266):1332\u20131338, 2015.\n\n[14] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[15] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Proceedings of\n\nAdvances in neural information processing systems, pages 598\u2013605, 1990.\n\n[16] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the\nwild. In Proceedings of IEEE International Conference on Computer Vision, pages 3730\u20133738,\n2015.\n\n[17] Yongxi Lu, Abhishek Kumar, Shuangfei Zhai, Yu Cheng, Tara Javidi, and Rogerio Feris.\nFully-adaptive feature sharing in multi-task networks with applications in person attribute\nclassi\ufb01cation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition,\npages 5334\u20135343, 2017.\n\n10\n\n\f[18] Akhil Mathur, Nicholas D Lane, Sourav Bhattacharya, Aidan Boran, Claudio Forlivesi, and\nFahim Kawsar. Deepeye: Resource ef\ufb01cient local execution of multiple deep vision models\nusing wearable commodity hardware. In Proceedings of ACM Annual International Conference\non Mobile Systems, Applications, and Services, pages 68\u201381, 2017.\n\n[19] Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. Cross-stitch networks\nfor multi-task learning. In Proceedings of IEEE Conference on Computer Vision and Pattern\nRecognition, pages 3994\u20134003, 2016.\n\n[20] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.\nReading digits in natural images with unsupervised feature learning. In NIPS workshop on deep\nlearning and unsupervised feature learning, volume 2011, page 5, 2011.\n\n[21] Sylvestre-Alvise Rebuf\ufb01, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains\nwith residual adapters. In Proceedings of Advances in Neural Information Processing Systems,\npages 506\u2013516, 2017.\n\n[22] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and\n\nYoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.\n\n[23] Rasmus Rothe, Radu Timofte, and Luc Van Gool. Deep expectation of real and apparent\nage from a single image without facial landmarks. International Journal of Computer Vision,\n126(2-4):144\u2013157, 2018.\n\n[24] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual\nrecognition challenge. International Journal of Computer Vision, 115(3):211\u2013252, 2015.\n\n[25] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[26] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human\n\nactions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.\n\n[27] Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. Man vs. computer:\nBenchmarking machine learning algorithms for traf\ufb01c sign recognition. Neural Networks,\n32:323\u2013332, 2012.\n\n[28] Yongxin Yang and Timothy Hospedales. Deep multi-task representation learning: A tensor\nfactorisation approach. In Proceedings of International Conference on Learning Representations,\n2016.\n\n[29] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in\ndeep neural networks? In Proceedings of Advances in Neural Information Processing Systems,\npages 3320\u20133328, 2014.\n\n[30] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks.\n\narXiv:1605.07146, 2016.\n\narXiv preprint\n\n[31] Yu Zhang and Qiang Yang. A survey on multi-task learning. arXiv preprint arXiv:1707.08114,\n\n2017.\n\n11\n\n\f", "award": [], "sourceid": 2947, "authors": [{"given_name": "Xiaoxi", "family_name": "He", "institution": "ETH Z\u00fcrich"}, {"given_name": "Zimu", "family_name": "Zhou", "institution": "ETH Zurich"}, {"given_name": "Lothar", "family_name": "Thiele", "institution": "ETH Z\u00fcrich"}]}