{"title": "Modular Universal Reparameterization: Deep Multi-task Learning Across Diverse Domains", "book": "Advances in Neural Information Processing Systems", "page_first": 7903, "page_last": 7914, "abstract": "As deep learning applications continue to become more diverse, an interesting question arises: Can general problem solving arise from jointly learning several such diverse tasks? To approach this question, deep multi-task learning is extended in this paper to the setting where there is no obvious overlap between task architectures. The idea is that any set of (architecture,task) pairs can be decomposed into a set of potentially related subproblems, whose sharing is optimized by an efficient stochastic algorithm. The approach is first validated in a classic synthetic multi-task learning benchmark, and then applied to sharing across disparate architectures for vision, NLP, and genomics tasks. It discovers regularities across these domains, encodes them into sharable modules, and combines these modules systematically to improve performance in the individual tasks. The results confirm that sharing learned functionality across diverse domains and architectures is indeed beneficial, thus establishing a key ingredient for general problem solving in the future.", "full_text": "Modular Universal Reparameterization:\n\nDeep Multi-task Learning Across Diverse Domains\n\nElliot Meyerson\n\nCognizant\n\nelliot.meyerson@cognizant.com\n\nRisto Miikkulainen1,2\n\nCognizant1\n\nThe University of Texas at Austin2\n\nristo@cs.utexas.edu\n\nAbstract\n\nAs deep learning applications continue to become more diverse, an interesting\nquestion arises: Can general problem solving arise from jointly learning several\nsuch diverse tasks? To approach this question, deep multi-task learning is extended\nin this paper to the setting where there is no obvious overlap between task architec-\ntures. The idea is that any set of (architecture,task) pairs can be decomposed into a\nset of potentially related subproblems, whose sharing is optimized by an ef\ufb01cient\nstochastic algorithm. The approach is \ufb01rst validated in a classic synthetic multi-task\nlearning benchmark, and then applied to sharing across disparate architectures for\nvision, NLP, and genomics tasks. It discovers regularities across these domains,\nencodes them into sharable modules, and combines these modules systematically\nto improve performance in the individual tasks. The results con\ufb01rm that sharing\nlearned functionality across diverse domains and architectures is indeed bene\ufb01cial,\nthus establishing a key ingredient for general problem solving in the future.\n\n1\n\nIntroduction\n\nDeep learning methods and applications continue to become more diverse. They now solve problems\nthat deal with fundamentally different kinds of data, including those of human behavior, such as\nvision, language, and speech, as well as those of natural phenomena, such as biological, geological,\nand astronomical processes.\nAcross these domains, deep learning architectures are painstakingly customized to different problems.\nHowever, despite this extreme customization, a crucial amount of functionality is shared across\nsolutions. For one, architectures are all made of the same ingredients: some creative composition\nand concatenation of high-dimensional linear maps and elementwise nonlinearities. They also share\na common set of training techniques, including popular initialization schemes and gradient-based\noptimization methods. The fact that the same small toolset is successfully applied to all these\nproblems implies that the problems have a lot in common. Sharing these tools across problems\nexploits some of these commonalities, i.e., by setting a strong prior on the kinds of methods that will\nwork. Such sharing is methodological, with humans determining what is shared.\nThis observation begs the question: Are there commonalities across these domains that methodolog-\nical sharing cannot capture? Note that this question is different from that addressed by previous\nwork in deep multi-task learning (DMTL), where the idea is to share knowledge across tasks in the\nsame domain or modality, such as within vision [5, 30, 33, 39, 57, 61] or language [9, 13, 16, 31, 34].\nIn contrast, this question is fundamental to general problem solving: Can it be bene\ufb01cial to share\nlearned functionality across a diverse set of tasks, such as a 2D convolutional vision network, an\nLSTM model for natural language, and a 1D convolutional model for genomics? Speci\ufb01cally, this\npaper considers the following problem: Given an arbitrary set of (architecture,task) pairs, can learned\nfunctionality be shared across architectures to improve performance in each task?\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fDrawing on existing approaches to DMTL, a \ufb01rst approach to this problem is developed, showing\nthat such effective sharing is indeed possible. The approach is based on decomposing the general\nmulti-task learning problem into several \ufb01ne-grained and equally-sized subproblems, or pseudo-tasks.\nTraining a set of (architecture,task) pairs then corresponds to solving a set of related pseudo-tasks,\nwhose relationships can be exploited by shared functional modules. To make this framework practical,\nan ef\ufb01cient search algorithm is introduced for optimizing the mapping between pseudo-tasks and\nthe modules that solve them, while simultaneously training the modules themselves. The approach,\nmodular universal reparameterization (MUiR), is validated in a synthetic MTL benchmark problem,\nand then applied to large-scale sharing between the disparate modalities of vision, NLP, and genomics.\nIt leads to improved performance on each task, and highly-structured architecture-dependent sharing\ndynamics, in which the modules that are shared more demonstrate increased properties of generality.\nThese results show that MUiR makes it possible to share knowledge across diverse domains, thus\nestablishing a key ingredient for building general problem solving systems in the future.\n\n2 Problem Statement and Related Work\n\nThis paper is concerned with the following question: Given an arbitrary set of (architecture,task)\npairs, can learned functionality be shared across architectures to improve performance in each task?\nAny method that answers this question must satisfy two requirements: (1) It must support any given\nset of architectures, and (2) it must align parameters across the given architectures.\nParameters in two architectures are aligned if they have some learnable tensor in common. An\nalignment across architectures implies how tasks are related, and how much they are related. The\ngoal of DMTL is to improve performance across tasks through joint training of aligned architectures,\nexploiting inter-task regularities. In recent years, DMTL has been applied within areas such as vision\n[5, 30, 33, 39, 57, 61], natural language [9, 13, 16, 31, 34], speech [19, 46, 55], and reinforcement\nlearning [11, 20, 51]. The rest of this section reviews existing DMTL methods, showing that none of\nthese methods satisfy both conditions (1) and (2).\nThe classical approach to DMTL considers a joint model across tasks in which some aligned layers\nare shared completely across tasks, and the remaining layers remain task-speci\ufb01c [7]. In practice, the\nmost common approach is to share all layers except for the \ufb01nal classi\ufb01cation layers [11, 13, 18, 19,\n20, 31, 42, 55, 61]. A more \ufb02exible approach is to not share parameters exactly across shared layers,\nbut to factorize layer parameters into shared and task-speci\ufb01c factors [3, 23, 28, 32, 44, 56, 57]. Such\napproaches work for any set of architectures that have a known set of aligned layers. However, these\nmethods only apply when such alignment is known a priori. That is, they do not meet condition (2).\nOne approach to overcome the alignment problem is to design an entirely new architecture that\nintegrates information from different tasks and is maximally shared across tasks [5, 16, 22]. Such\nan approach can even be used to share knowledge across disparate modalities [22]. However, by\ndisregarding task-speci\ufb01c architectures, this approach does not meet condition (1). Related approaches\nattempts to learn how to assemble a set of shared modules in different ways to solve different tasks,\nwhether by gradient descent [37], reinforcement learning [45], or evolutionary architecture search\n[30]. These methods also construct new architectures, so they do not meet condition (1); however,\nthey have shown that including a small number of location-speci\ufb01c parameters is crucial to sharing\nfunctionality across diverse locations.\nDrawing on the methods above, this paper introduces a \ufb01rst approach that meets both conditions.\nFirst, a simple decomposition is introduced that applies to any set of architectures and supports\nautomatic alignment. This decomposition is extended to include a small number of location-speci\ufb01c\nparameters, which are integrated in a manner mirroring factorization approaches. Then, an ef\ufb01cient\nalignment method is developed that draws on automatic assembly methods. These methods combine\nto make it possible to share effectively across diverse architectures and modalities.\n\n3 Modular Universal Reparameterization\n\nThis section presents a framework for decomposing sets of (architecture,task) pairs into equally-sized\nsubproblems (i.e., pseudo-tasks), sharing functionality across aligned subproblems via a simple\nfactorization, and optimizing this alignment with an ef\ufb01cient stochastic algorithm.\n\n2\n\n\f3.1 Decomposition into linear pseudo-tasks\nConsider a set of T tasks {{xti, yti}Nt\nt=1,\neach parameterized by a set of trainable tensors \u2713Mt. In MTL, these sets have non-trivial pairwise\nintersections, and are trained in a joint model to \ufb01nd optimal parameters \u2713?\n\nt=1, with corresponding model architectures {Mt}T\n\ni=1}T\n\nMt for each task:\n\nT[t=1\n\n\u2713?\nMt = argmin\nt=1 \u2713Mt\n\nST\n\n1\nT\n\nTXt=1\n\n1\nNt\n\nNtXi=1\n\nLt(yti, \u02c6yti),\n\n(1)\n\nwhere \u02c6yti = Mt(xti; \u2713Mt) is a prediction and Lt is a sample-wise loss function for the tth task.\nGiven \ufb01xed task architectures, the key question in designing an MTL model is how the \u2713Mt should\nbe aligned. The following decomposition provides a generic way to frame this question.\nSuppose each tensor in each \u2713Mt can be decomposed into equally-sized parameter blocks B` of size\nm \u21e5 n, and there are L such blocks total across all \u2713Mt. Then, the parameterization for the entire\njoint model can be rewritten as:\n\nT[t=1\n\n\u2713Mt = (B1, . . . , BL).\n\n(2)\n\nThat is, the entire joint parameter set can be regarded as a single tensor B 2 RL\u21e5m\u21e5n. The vast\nmajority of parameter tensors in practice can be decomposed in this way such that each B` de\ufb01nes\na linear map. For one, the pm \u21e5 qn weight matrix of a dense layer with pm inputs and qn outputs\ncan be broken into pq blocks of size m \u21e5 n, where the (i, j)th block de\ufb01nes a map between units im\nto (i + 1)m  1 of the input space and units jn to (j + 1)n  1 of the output space. This approach\ncan be extended to convolutional layers by separately decomposing each matrix corresponding to a\nsingle location in the receptive \ufb01eld. Similarly, the parameters of an LSTM layer are contained in\nfour matrices, each of which can be separately decomposed. When m and n are relatively small, the\nrequirement that m and n divide their respective dimensions is a minor constraint; layer sizes can be\nadjusted without noticeable effect, or over\ufb02owing parameters from edge blocks can be discarded.\nNow, if each B` de\ufb01nes a linear map, then train-\ning B corresponds to solving L linear pseudo-\ntasks [38] that de\ufb01ne subproblems within the\njoint model. Suppose B` de\ufb01nes a linear map\nin Mt. Then, the `th pseudo-task is solved by\ncompleting the computational graph of Mt with\nthe subgraph corresponding to B` removed. The\n`th pseudo-task is denoted by a \ufb01ve-tuple\ni=1),\n\nFigure 1: Pseudo-task decomposition. Architec-\n(3)\nture M, for task {xi, yi}N\ni=1, induces a pseudo-\ntask solved by a function f. E is an encoder that\nwhere E` is the encoder that maps each xti to the\nprovides input to f, and D is a decoder that uses\ninput of a function solving the pseudo-task, and\nthe output of f to produce the \ufb01nal prediction. If\nD` takes the output of that function (and possibly\nf is effective for many [task, encoder, decoder]\nxti) to the prediction \u02c6yti. The parameters \u2713E`\ncombinations, then it shows generic functionality.\nand \u2713D` characterize E` and D`, respectively.\nIn general, given a pseudo-task, the model for the tth task is completed by a differentiable function f\nthat connects the pseudo-task\u2019s inputs to its outputs. The goal for solving this pseudo-task is to \ufb01nd a\nfunction that minimizes the loss of the underlying task. The completed model is given by\n\n(E`,\u2713 E`,D`,\u2713 D`,{xti, yti}Nt\n\n\u02c6yt = D`(f (E`(xt; \u2713E`); \u2713f ), xt; \u2713D`).\n\n(4)\nThis formulation is depicted in Figure 1. Since all L pseudo-tasks induced by Eq. 2 have the same\ninput-output speci\ufb01cation, if f solves one of them, it can be applied to any of them in a modular way.\nSince all pseudo-tasks are derived from the same universe of tasks and architectures, sharing modules\nacross them can be valuable. Indeed, sharing across related parameter blocks is a common tool to\nimprove generalization in deep learning. For example, a convolutional layer can be viewed as a dense\nlayer with parameter blocks shared across space, and a recurrent layer as a sequential network of\ndense layers with parameter blocks shared across depths, i.e., time. Similarly, the standard DMTL\napproach is to design a joint architecture with some parameter blocks shared across related tasks.\nThis paper extends DMTL to sharing factors across related pseudo-tasks.\n\n3\n\nfxy\f3.2 Reparameterization by hypermodules\n\nAssuming an effective alignment of related pseudo-tasks exists, how should parameters be shared\nacross them? Reusing modules at qualitatively different locations in a network has been successful\nwhen a small number of location-speci\ufb01c parameters are included to increase \ufb02exibility [30, 37], and\nhas been detrimental when such parameters are not included [45]. To include such parameters in a\nsimple and \ufb02exible way, and avoid additional assumptions about the kind of sharing that can occur,\neach B` can be generated by a hypermodule, the module-speci\ufb01c analog of a hypernetwork [15, 48].\nAssociate with the `th pseudo-task a context vector z` 2 Rc. Suppose there is also a collection of K\nhypermodules {Hk}K\nk=1 be an alignment\nfunction that indicates which hypermodule solves the `th pseudo-task. Then, the parameters of the\nunderlying architectures are generated by\n\nk=1, with Hk 2 Rc\u21e5m\u21e5n, and let : {1, . . . , L}!{ Hk}K\n\n(5)\nwhere \u00af\u21e51 denotes the 1-mode (vector) product of a tensor and a vector [25]. In other words, the value\nat B`ij is the dot product between z` and the \ufb01ber in (`) associated with the (i, j)th element of B`.\nWith the additional goal of optimizing , the block decomposition (Eq. 2) can now be written as\n\nB` = (`) \u00af\u21e51 z`,\n\nT[t=1\n\n\u2713Mt = [(H1, . . . , HK), (z1, . . . , zL)].\n\n(6)\n\nTo accurately apply Eq. 6 to a set of architectures, the parameter initialization scheme must be\npreserved. Say the parameters of a layer are initialized i.i.d. with variance 2 and mean 0, and each\nB` is initialized with a distinct hypermodule (`) = H`. When c > 1, B`ij = hH`:ij, z`i is a sum\nof random variables, so it is impossible to initialize H` and z` i.i.d. such that B`ij is initialized\nfrom a uniform distribution. However, it is possible to initialize B`ij from a normal distribution, by\nH) and initializing z` with constant magnitude |z|:\ninitializing H` from a normal distribution N (0, 2\n(7)\n\n.\n\n\ncH\n\nB`ij = hH`:ij, z`i \u21e0 c|z|N (0, 2\n\nH) = N (0, z2c22\n\nH) = N (0, 2) =)| z| =\n\nH makes it easier for them to capture functionality that applies across pseudo-tasks.\n\nH are determined by He normal initialization [17], which implies a unique |z|.\nIn this paper, 2 and 2\nAlthough z` could be initialized uniformly from {z, z}c, it is instead initialized to the constant z,\nto encourage compatibility of hypermodules across contexts. Similarly, the fact that all Hk have the\nsame 2\nAlthough it is pessimistic to initialize each pseudo-task with its own hypermodule, parsimonious\nmodels can be achieved through optimization of . Using the same hypermodule for many pseudo-\ntasks has the side-bene\ufb01t of reducing the size of the joint model. The original model in Eq. 2 has\nLmn trainable parameters, while Eq. 6 has Lc + Kcmn, which is more parsimonious only when\nK < L(mnc)/cmn < L/c, i.e., when each hypermodule is used for more than c pseudo-tasks on\naverage. However, after training, any hypermodule used fewer than c times can be replaced with\nthe parameters it generates, so the model complexity at inference is never greater than that of the\noriginal model: (L  Lo)c + Kcmn + Lomn \uf8ff Lmn, where Lo is the number of pseudo-tasks\nparameterized by hypermodules used fewer than c times. An algorithm that improves parsimony in\nthis way while exploiting related pseudo-tasks is introduced next.\n\n3.3\n\nInterleaved optimization of pseudo-task alignment\n\nGiven the above decomposition and reparameterization, the goal is to \ufb01nd an optimal alignment ,\ngiven by a \ufb01xed-length mapping ( (1), . . . , (L)), with K possible choices for each element. Let h\nbe a scoring function that returns the performance of a mapping via training and evaluation of the joint\nmodel. In order to avoid training the model from scratch each iteration, existing DMTL approaches\nthat include nondifferentiable optimization interleave this optimization with gradient-based updates\n[8, 30, 33, 38, 45]. These methods take advantage of the fact that at every iteration there are T scores,\none for each task. These scores can be optimized in parallel, and faster convergence is achieved, by\neffectively decomposing the problem into T subproblems. This section illustrates that such problem\ndecomposition can be greatly expanded, leading to practical optimization of .\n\n4\n\n\fDecomposition Level\nExpected Convergence Time\n\nNone (Multi-task)\n\nO(KL log L)\n\nPer-task (Single-task)\n\nO KL(log Llog T ) log T\n\nT\n\n\n\nPer-block (Pseudo-task)\n\nO(K log L)\n\nTable 1: Complexity of pseudo-task alignment. This table gives the expected times of Algorithm 1 for\n\ufb01nding the optimal mapping of L pseudo-tasks to K hypermodules, in a model with T tasks. The\nruntime of pseudo-task-level optimization scales logarithmically with the size of the model.\n\n1, . . . , 0\n\nd is suboptimal do\n\nD each of length L\nD\n\nfor d = 1 to D do\nfor i = 1 to  do\n\nAlgorithm 1 Decomposed K-valued (1 + )-EA\n1: Create initial solutions 0\n2: while any 0\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n\nIn general, may be decomposed into D\nsubmappings { d}D\nd=1, each with a distinct\nevaluation function hd. For simplicity, let\neach submapping be optimized with an in-\nstance of the (1+)-EA, a Markovian algo-\nrithm that is robust to noise, dynamic envi-\nronments, and local optima [12, 40, 49],\nand is a component of existing DMTL\nmethods [30, 38]. The algorithm gener-\nates new solutions by resampling elements\nof the best solution with an optimal \ufb01xed\nprobability. Algorithm 1 extends the (1+)-\nEA to optimizing submappings in parallel. Assume each d has length L/D,  = 1, all hd are linear,\ni.e., hd( d) = PL\nd(`)), where wd` are positive scalars, I is the indicator\nfunction, and ? is a unique optimal mapping, with ?(`) = H1 8`. The runtime of this algorithm\n(number of iterations through the while loop) is summarized by the following result (proof in S.1):\nTheorem 3.1. The expected time of the decomposed K-valued (1+1)-EA is O( KL(log Llog D) log D\n),\nwhen all hd are linear.\n\n i\nd 0\nfor ` = 1 to L\nWith probability D\n\n`=1 wd` \u00b7 I( d(`) = ?\n\nd(`) \u21e0U ({Hk}K\n\nfor t = 1 to d do\n\nd = argmax i\n\nh( i\nd)\n\nd\n\nL , i\n\nd\n\nD do\n\nk=1)\n\n 0\n\nD\n\nResulting runtimes for key values of D are given in Table 1. As expected, setting D = T gives a\nsubstantial speed-up over D = 1. However, when T is small relative to L, e.g., when sharing across\na small number of complex models, the factor of L in the numerator is a bottleneck. Setting D = L\novercomes this issue, and corresponds to having a distinct evaluation function for each pseudo-task.\nThe pessimistic initialization suggested in Section 3.2 avoids initial detrimental sharing, but introduces\nanother bottleneck: large K. This bottleneck can be overcome by sampling hypermodules in Line 7\nproportional to their usage in 0. Such proportional sampling encodes a prior which biases search\ntowards modules that already show generality, and yields the following result (proof in S.2):\nTheorem 3.2. The expected time of the decomposed K-valued (1+1)-EA with pessimistic initialization\nand proportional sampling is O(log L), when D = L, and all hd are linear.\n\nAgain, this fast convergence requires a pseudo-task-level evaluation function h. The solution adopted\nin this paper is to have the model indicate its hypermodule preference directly through backpropa-\ngation, by learning a softmax distribution over modules at each location. Similar distributions over\nmodules have been learned in previous work [30, 37, 47]. In Algorithm 1, at a given time there are\n1 +  active mapping functions { i}\ni=0 for\ni=0. Through backpropagation, the modules { i(`)}\neach location ` can compete by generalizing Eq. 5 to include a soft-merge operation:\nXi=0\n\n i(`) \u00af\u21e51 z` \u00b7 softmax(s`)i,\n\nwhere s` 2 R+1 is a vector of weights that induces a probability distribution over hypermodules.\nThrough training, the learned probability of softmax(s`)i is the model\u2019s belief that i(`) is the best\noption for location ` out of { i(`)}\ni=0. Using this belief function, Algorithm 1 can optimize while\nsimultaneously learning the model parameters. Each iteration, the algorithm trains the model via Eq. 8\nwith backpropagation for niter steps, and h( i\n`), accounting\nfor duplicates. In contrast to existing model-design methods, task performance does not guide search;\nthis avoids over\ufb01tting to the validation set over many generations. Validation performance is only\nused for early stopping. Pseudocode for the end-to-end algorithm, along with additional training\nconsiderations, are given in S.3. The algorithm is evaluated experimentally in the next section.\n\n`) returnsP\n\nj=0 softmax(s`)j \u00b7 I( j\n\n` = i\n\nB` =\n\n(8)\n\n5\n\n\fFigure 2: Visualizing convergence. These images show the convergence of on the synthetic\ndataset. Each color corresponds to a distinct hypermodule. The color shown at each location is the\nhypermodule currently in use for that task. After generation 59 the model remains at the optimal\nsolution inde\ufb01nitely, demonstrating the ef\ufb01cient convergence of MUiR.\n\nThe theoretical scalability of the algorithm means it can be applied in settings where existing DMTL\nmodule assembly methods are infeasible. For instance, when learning the alignment with soft ordering\n[37] the module operations increase quadratically; sampling from the softmax instead [47] would\nrequire thousands of additional parameters per module location; learning the alignment with CTR [30]\nis infeasibly complex via Theorem 3.1. These limitations are highlighted in the fact that experiments\nwith existing approaches use at most 4 [37], 4 [30], and 10 [45] modules, i.e., orders of magnitude\nfewer than what is considered in this paper (e.g., more than 10K modules in Section 4.2).\n\n4 Experiments\n\nThis section evaluates the approach developed in Section 3. First, the dynamics of the approach are\nvalidated a synthetic MTL benchmark. Second, the approach is applied to a scale-up problem of\nsharing across diverse architectures and modalities. See S.4 for additional experimental details.\n\nClean\n\n-\n-\n-\n-\n\nNoisy\n0.97\n0.48\n0.42\n0.35\n\n1.35 \u00b1 0.01\n1.26 \u00b1 0.04\n0.77 \u00b1 0.77\n0.00 \u00b1 0.00\n\n1.49 \u00b1 0.01\n4.67 \u00b1 1.48\n0.37 \u00b1 0.00\n0.38 \u00b1 0.00\n\nMethod\nSTL [23]\nMTL-FEAT [3]\nDG-MTL [23]\nGO-MTL [28]\nSTL (ours)\nMUiR + Random\nMUiR + Oracle\nMUiR + Optimization\n\n4.1 Validating framework dynamics on a synthetic dataset\nThis section considers an MTL problem where the\nground truth alignment is known. The dataset con-\ntains three groups of ten linear regression tasks\nwith input dimension 20, but only 15 training sam-\nples per task [23]. The ground truth parameter\nvector for tasks within a group differ only by a\nscalar. Tasks cannot be solved without exploiting\nthis regularity. Two versions of the problem were\nconsidered, one with Gaussian noise added to sam-\nple outputs, and one with no noise. As in previous\nwork, each task model is linear, consisting of a sin-\ngle weight vector 2 R20. In the single-task (STL)\ncase, these vectors are trained independently. In the\nMTL case (MUiR), c = 1, and each task is repa-\nrameterized with a single hypermodule 2 R1\u21e520\u21e51.\nSo, Algorithm 1 is initialized with 30 hypermod-\nules, and should converge to using only three, i.e.,\none for each group. For comparison, a Random\nsearch setup is included (i.e., replacing argmax in\nAlgorithm 1 with a random choice), as well as an\nOracle setup, in which is \ufb01xed to the true group\nalignment. Unlike in previous work, \ufb01ve training\nsamples for each task were withheld as validation data, making the setup more dif\ufb01cult.\nMUiR quickly converges to the true underlying grouping in the noiseless case (Figure 2), and yields\noptimal test loss (Table 2). In the noisy case, MUiR results in a similar improvement over the\nbaselines. Since a linear model is optimal for this dataset, MUiR cannot improve over the best linear\nmethod, but it achieves comparable results, despite differences in the setup that make generalization\nmore dif\ufb01cult: withholding data for validation and absence of additional regularization. These results\n\nTable 2: Synthetic results. MUiR achieves\nperfect test RMSE in the clean case, even out-\nperforming the Oracle, which can sometimes\nover\ufb01t. MUiR similarly outperforms baselines\nin the noisy case. Since a linear model is opti-\nmal for this dataset, MUiR cannot improve over\nthe best linear method, but it achieves compara-\nble results despite differences in the setup that\nmake it more dif\ufb01cult: withholding data for val-\nidation and absence of additional regularization.\nAlso, in contrast to the other methods, MUiR\nlearns the number of groups automatically.\n\n6\n\n\fModality\nVision\nText\nDNA\nVision\n\nArchitecture\nWRN-40-1 (W)\nStacked LSTM (S)\nDeepBind-256 (D)\nLeNet (L)\n\nBaseline\n\nIntratask\n\n8.48\n134.41\n0.1540\n21.08\n\n8.50\n132.06\n0.1466\n20.67\n\nW+S\n8.69\n130.63\n\n-\n-\n\nW+D\n9.20\n\n0.1461\n\n-\n\n-\n\nS+D\n\nW+S+D\n\nL+S\n\nL+D\n\nL+S+D\n\n132.62\n0.1469\n\n-\n\n-\n\n9.02\n128.10\n0.1464\n\n-\n\n-\n\n-\n\n129.73\n\n21.02\n\n-\n-\n\n0.1469\n19.59\n\n-\n\n130.77\n0.1464\n20.23\n\nTable 3: Cross-modal results. This table shows the performance of each architecture across a chain\nof comparisons. Baseline trains the underlying model; Intratask uses MUiR with a single task\narchitecture; the remaining setups indicate multiple architectures trained jointly with MUiR. Lower\nscores are better: classi\ufb01cation error for vision, perplexity for text and MSE for DNA. For each\narchitecture, the top two setups are in bold. The LSTM, DeepBind, and LeNet models all bene\ufb01t from\ncross-modal sharing; and in all 16 cases, MUiR improves their performance over Baseline. Although\nthe text and DNA models both bene\ufb01t from sharing with WRN, the effect is not reciprocated. The\nfact that LeNet improves suggests that it is not a problem in transferring across modalities, but that\nWRN has an architecture that is easier to share from than to. Overall, the ability of MUiR to improve\nperformance, even in the intratask case, indicates that it can exploit pseudo-task regularities.\n\nshow that the softmax evaluation function effectively determines the value of hypermodules at each\nlocation. The next section shows that the algorithm scales to more complex problems.\n\n4.2 Sharing across diverse architectures and modalities\n\nThis experiment applies MUiR in its intended setting: sharing across diverse architectures and\nmodalities. The hypermodules generate 16 \u21e5 16 linear maps, and have context size c = 4, as in\nprevious work on hypernetworks [15]. The joint model shares across a vision problem, an NLP\nproblem, and a genomics problem (see S.5 for additional dataset and architecture details).\nThe \ufb01rst task is CIFAR-10, the classic image classi\ufb01cation benchmark of 60K images [26]. As\nin previous work on hypernetworks, WideResNet-40-1 (WRN) is the underlying model [15, 58],\nyielding 2268 blocks to parameterize with hypermodules. The second task is WikiText-2 language\nmodeling benchmark with over 2M tokens [36]. The underlying model is the standard stacked LSTM\nmodel with two LSTM layers each with 256 units [59], yielding 4096 blocks. The third task is\nCRISPR binding prediction, where the goal is to predict the propensity of a CRISPR protein complex\nto bind to (and cut) unintended locations in the genome [21]. The dataset contains binding af\ufb01nities\nfor over 30M base pairs. The underlying model, DeepBind-256, is from the DeepBind family of\n1D-convolutional models designed for protein binding problems [2, 60], yielding 6400 blocks.\n\n4.2.1 Performance comparison across (architecture,task) subsets\nFor each of these three task-architecture pairs, a chain of comparisons were run, with increasing\ngenerality: a Baseline that trained the original architecture; an Intratask setup that applied MUiR\noptimization within a single task model; cross-modal optimization for each pair of tasks; and a\ncross-modal run across all three tasks. The main result is that the text and genomics models always\nimprove when they are trained with MUiR, and improve the most when they are trained jointly with\nthe WRN model (Table 3). This result raises a key question: Does the (WRN,vision) pair behave\ndifferently because of WRN or because of vision? To answer this question, an additional set of\nexperiments were run using LeNet [29] as the vision model. This model does indeed always improve\nwith MUiR, and improves the most with cross-modal sharing (Table 3), while similarly improving\nthe text and genomics models. The improvements for all three tasks are signi\ufb01cant (S.4). Overall, the\nresults con\ufb01rm that MUiR can improve performance by sharing across diverse modalities. A likely\nreason that the bene\ufb01t of WRN is one-directional is that the modules in WRN are highly specialized\nto work together as a deep stack. They provide useful diversity in the search for general modules, but\nthey are hard to improve using such modules. This result is important because it both illustrates where\nthe power of MUiR is coming from (diversity) and identi\ufb01es a key challenge for future methods.\n\n4.2.2 Analysis of module sharing dynamics\nTo understand the discovery process of MUiR, Figure 3a shows the number of modules used exclu-\nsively by each subset of tasks over time in a W+D+S run. The relative size of each subset stabilizes as\n\n7\n\n\f(a)\n\n(b)\n\nFigure 3: (a) Module sharing over time. The number of modules shared exclusively by each subset of\ntasks is shown for a MUiR run. The differences across subsets show that MUiR optimizes alignment\nin an architecture-dependent way. For example, the number of modules used only by the WRN and\nLSTM models always stays small, and the number used only by the DeepBind model eventually\nshrinks to almost zero, suggesting that the genomics model plays a central role in sharing. As a\nside-bene\ufb01t of this optimization, the number of parameters in the model decreases (blue line). (b)\nLayer-level sharing. To measure sharing across pairs of layers, for each pair in an L+S+D run,\nthis heatmap shows how many times more likely pairs of pseudo-tasks from those layers are to use\nthe same module than they would by chance. Sharing is highly architecture-dependent, with the\n1D-convolutional model playing a central role between the 2D-convolutional and 1D-LSTM models.\n\n is optimized, and is consistent over independent runs, showing that MUiR shares in an architecture-\ndependent way. In particular, the number of modules used only by W and S models remains small,\nand the number used only by D shrinks to near zero, suggesting that the genomics model plays a\ncentral role in sharing. Analyzed at the layer level in the L+S+D setup, the bulk of sharing does\nindeed involve D (Figure 3b). D and L are both convolutional, while D and S process 1-dimensional\ninput, which may make it easier for L and S to share with D than directly with each other.\nA side-bene\ufb01t of MUiR is that the number of model parameters decreases over time (up to 20% in\nFigure 3a), which is helpful when models need to be small, e.g., on mobile devices. Such shrinkage\nis achieved when the optimized model has many modules that are used for many pseudo-tasks.\nHypermodules are considered generic if they are used more than c times in the joint model, and\nspeci\ufb01c otherwise. Similarly, pseudo-tasks are considered generic if they use generic modules and\nspeci\ufb01c otherwise, along with their contexts and generated linear maps. Sets of generic and speci\ufb01c\ntensors were compared based on statistical properties of their learned parameters. The generic tensors\nhad signi\ufb01cantly smaller average standard deviation, L2-norm, and max value (Table 4). Such a\ntighter distribution of parameters indicates greater generality [4, 27].\n\n4.2.3 Ablations and DMTL comparisons\n\nParameter Group\nHypermodules\nContexts\nLinear Maps\n\nStdev\n7e-4\n1e-43\n3e-153\n\nMax\n6e-3\n5e-126\n4e-146\n\nMean\n3e-1\n1e-143\n5e-2\n\nNorm\n8e-4\n4e-138\n5e-153\n\nEven though their application seems unnatural for\nthe cross-domain problem, experiments were per-\nformed using existing DMTL methods: classical\nDMTL (e.g., [13, 19, 61]), i.e., where aligned\nparameters are shared exactly across tasks; and\nparallel adapters [44], which is state-of-the-art\nfor vision MTL. Both of these methods require\na hierarchical alignment of parameters across ar-\nchitectures. Here, the most natural hierarchical\nalignment is used, based on a topological sort of\nthe block locations within each architecture: the\nith location uses the ith parameter block. MUiR\noutperforms the existing methods (Table 6). Inter-\nestingly, the existing methods each outperforms\nsingle task learning (STL) on two out of three tasks. This result shows the value of the universal\ndecomposition in Section 3.1, even when used with other DMTL approaches.\n\nTable 4: Generic vs. speci\ufb01c modules. For a\nW+S+D run of MUiR, this table gives two-tailed\np-values (Mann-Whitney) comparing generic vs.\nspeci\ufb01c weight tensors over four statistics for\neach parameter group: modules, contexts, and\nthe linear maps they generate. The generic ten-\nsors tend to have a much tighter distribution of\nparameters, indicative of better generalization:\nThey must be applied in many situations with\nminimal disruption to overall network behavior.\n\n8\n\n\fMethod\nSingle Task Learning\nClassical DMTL (e.g., [13, 19, 61])\nParallel Adapters [44]\nMUiR + Hierarchical Init.\nMUiR\n\nLeNet\n21.46\n21.09\n21.05\n20.72\n20.51\n\nStacked LSTM DeepBind\n\n135.03\n145.88\n132.02\n128.94\n130.70\n\n0.1543\n0.1519\n0.1600\n0.1465\n0.1464\n\nc\n0\n1\n2\n4\n8\n\nLeNet\n21.89\n21.80\n20.40\n20.51\n20.62\n\nStacked LSTM DeepBind\n\n144.52\n140.94\n133.94\n130.70\n130.80\n\n0.1508\n0.1477\n0.1504\n0.1464\n0.1468\n\nTable 6: Comparison across DMTL methods. MUiR outper-\nforms the other methods, even with hierarchical initialization.\n\nTable 7: Comparison across c. Hy-\npermodules (c > 0) are bene\ufb01cial.\n\nNext, the signi\ufb01cance of the initialization method was tested, by initializing MUiR with the\nhierarchical alignment used by the other methods, instead of the disjoint initialization suggested\nby Theorem 3.2. This method (Table 6: MUiR+Hierarchical Init.) still outperforms the previous\nmethods on all tasks, but may be better or worse than MUiR for a given task. This result con\ufb01rms the\nvalue of MUiR as a framework, and suggests that more sophisticated initialization could be useful.\nThe importance of hypermodule context size c was also tested. Comparisons were run with c = 0\n(blocks shared exactly), 1, 2, 4 (the default value), and 8. The results con\ufb01rm that location-speci\ufb01c\ncontexts are critical to effective sharing, and that there is robustness to the value of c (Table 7).\nFinally, MUiR was tested when applied to a highly-tuned\nWikitext-2 baseline: AWD-LSTM [35]. Experiments directly\nused the of\ufb01cial AWD-LSTM training parameters, i.e., they are\ntuned to AWD-LSTM, not MUiR. MUiR parameters were ex-\nactly those used in the other cross-domain experiments. MUiR\nachieves performance comparable to STL, while reducing the\nnumber of LSTM parameters from 19.8M to 8.8M during op-\ntimization (Table 5). In addition, MUiR outperforms STL with\nthe same number of parameters (i.e., with a reduced LSTM hidden size). These results show that\nMUiR supports ef\ufb01cient parameter sharing, even when dropped off-the-shelf into highly-tuned setups.\nHowever, MUiR does not improve the perplexity of the best AWD-LSTM model. The challenge is that\nthe key strengths of AWS-LSTM comes from its sophisticated training scheme, not its architecture.\nMUiR has uni\ufb01ed diverse architectures; future work must unify diverse training schemes.\n\nTable 5: Results on Wikitext-2 with\nAWD-LSTM [35].\n\nMethod\nSTL\nMUiR\nSTL\n\nLSTM Parameters\n\nPerplexity\n\n8.8M\n8.8M\n19.8M\n\n73.64\n71.01\n69.94\n\n5 Discussion and Future Work\n\nGiven a set of deep learning problems de\ufb01ned by potentially disparate (architecture,task) pairs, MUiR\nshows that learned functionality can be effectively shared between them. As the \ufb01rst solution to this\nproblem, MUiR takes advantage of existing DMTL approaches, but it is possible to improve it with\nmore sophisticated and insightful methods in the future. Hypermodules are able to capture general\nfunctionality, but more involved factorizations could more easily exploit pseudo-task relationships\n[32, 57]. Similarly, the (1 + )-EA is simple and amenable to analysis, but more sophisticated\noptimization schemes [10, 47, 54] may be critical in scaling to more open-ended settings.\nIn\nparticular, the modularity of MUiR makes extensions to lifelong learning [1, 6, 43, 52] especially\npromising: It should be possible to collect and re\ufb01ne a compact set of modules that are assembled in\nnew ways to solve future tasks as they appear, seamlessly integrating new architectural methodologies.\nSuch functionality is fundamental to general problem solving, providing a foundation for integrating\nand extending knowledge across all behaviors during the lifetime of an intelligent agent.\n\n6 Conclusion\n\nTo go beyond methodological sharing in deep learning, this paper introduced an approach to learning\nsharable functionality from a diverse set of problems. Training a set of (architecture,task) pairs is\nviewed as solving a set of related pseudo-tasks, whose relatedness can be exploited by optimizing\na mapping between hypermodules and the pseudo-tasks they solve. By integrating knowledge in\na modular fashion across diverse domains, the approach establishes a key ingredient for general\nproblem solving systems in the future.\n\n9\n\n\fAcknowledgments\nMany thanks to John Hawkins for introducing us to the CRISPR binding prediction problem and\nproviding the data set. Thanks also to the reviewers for suggesting comparisons across framework\ndesign choices and other DMTL methods.\n\nReferences\n[1] D. Abel, D. Arumugam, L. Lehnert, and M. Littman. State abstractions for lifelong reinforce-\n\nment learning. In Proc. of ICML, pages 10\u201319, 2018.\n\n[2] B. Alipanahi, A. Delong, M. T. Weirauch, and B. J. Frey. Predicting the sequence speci\ufb01cities\n\nof dna-and rna-binding proteins by deep learning. Nature biotechnology, 33(8):831, 2015.\n\n[3] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning,\n\n73(3):243\u2013272, 2008.\n\n[4] P. L. Bartlett. For valid generalization the size of the weights is more important than the size of\n\nthe network. In NIPS, pages 134\u2013140, 1997.\n\n[5] H. Bilen and A. Vedaldi. Integrated perception with recurrent multi-task neural networks. In\n\nNIPS, pages 235\u2013243. 2016.\n\n[6] E. Brunskill and L. Li. Pac-inspired option discovery in lifelong reinforcement learning. In\n\nProc. of ICML, pages 316\u2013324, 2014.\n\n[7] R. Caruana. Multitask learning. In Learning to learn, pages 95\u2013133. Springer US, 1998.\n\n[8] Z. Chen, V. Badrinarayanan, C.-Y. Lee, and A. Rabinovich. Gradnorm: Gradient normalization\n\nfor adaptive loss balancing in deep multitask networks. In Proc. of ICML 2018, 2018.\n\n[9] R. Collobert and J. Weston. A uni\ufb01ed architecture for natural language processing: Deep neural\n\nnetworks with multitask learning. In Proc. of ICML, pages 160\u2013167, 2008.\n\n[10] K. Deb and C. Myburgh. Breaking the billion-variable barrier in real-world optimization using\n\na customized evolutionary algorithm. In Proc. of GECCO, pages 653\u2013660, 2016.\n\n[11] C. Devin, A. Gupta, T. Darrell, P. Abbeel, and S. Levine. Learning modular neural network\n\npolicies for multi-task and multi-robot transfer. In Proc. of ICRA, pages 2169\u20132176, 2017.\n\n[12] B. Doerr, T. Jansen, and C. Klein. Comparing global and local mutations on bit strings. In Proc.\n\nof GECCO, pages 929\u2013936, 2008.\n\n[13] D. Dong, H. Wu, W. He, D. Yu, and H. Wang. Multi-task learning for multiple language\n\ntranslation. In Proc. of ACL, pages 1723\u20131732, 2015.\n\n[14] B. Eisenberg. On the expectation of the maximum of iid geometric random variables. Statistics\n\n& Probability Letters, 78(2):135\u2013143, 2008.\n\n[15] D. Ha, A. M. Dai, and Q. V. Le. Hypernetworks. In Proc. of ICLR, 2017.\n\n[16] K. Hashimoto, C. Xiong, Y. Tsuruoka, and R. Socher. A joint many-task model: Growing a\n\nneural network for multiple NLP tasks. In Proc. of EMNLP, pages 1923\u20131933, 2017.\n\n[17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. of\n\nCVPR, pages 770\u2013778, 2016.\n\n[18] J. T. Huang, J. Li, D. Yu, L. Deng, and Y. Gong. Cross-language knowledge transfer using\nmultilingual deep neural network with shared hidden layers. In Proc. of ICASSP, pages 7304\u2013\n7308, 2013.\n\n[19] Z. Huang, J. Li, S. M. Siniscalchi, et al. Rapid adaptation for deep neural networks through\n\nmulti-task learning. In Proc. of Interspeech, pages 3625\u20133629, 2015.\n\n10\n\n\f[20] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu.\n\nReinforcement learning with unsupervised auxiliary tasks. In Proc. of ICLR, 2017.\n\n[21] C. Jung, J. A. Hawkins, S. K. Jones, et al. Massively parallel biophysical analysis of crispr-cas\n\ncomplexes on next generation sequencing chips. Cell, 170(1):35\u201347, 2017.\n\n[22] L. Kaiser, A. N. Gomez, N. Shazeer, A. Vaswani, N. Parmar, L. Jones, and J. Uszkoreit. One\n\nmodel to learn them all. CoRR, abs/1706.05137, 2017.\n\n[23] Z. Kang, K. Grauman, and F. Sha. Learning with whom to share in multi-task feature learning.\n\nIn Proc. of ICML, pages 521\u2013528, 2011.\n\n[24] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980,\n\n2014.\n\n[25] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Review, 51:455\u2013\n\n500, 2009.\n\n[26] A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images. 2009.\n\n[27] A. Krogh and J. A. Hertz. A simple weight decay can improve generalization. In NIPS, pages\n\n950\u2013957, 1992.\n\n[28] A. Kumar and H. Daum\u00b4e, III. Learning task grouping and overlap in multi-task learning. In\n\nProc. of ICML, pages 1723\u20131730, 2012.\n\n[29] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. Proc. of the IEEE, 86(11):2278\u20132324, 1998.\n\n[30] J. Liang, E. Meyerson, and R. Miikkulainen. Evolutionary architecture search for deep multitask\n\nnetworks. In Proc. of GECCO, 2018.\n\n[31] X. Liu, J. Gao, X. He, L. Deng, K. Duh, and Y. Y. Wang. Representation learning using\nmulti-task deep neural networks for semantic classi\ufb01cation and information retrieval. In Proc.\nof NAACL, pages 912\u2013921, 2015.\n\n[32] M. Long, Z. Cao, J. Wang, and P. S. Yu. Learning multiple tasks with multilinear relationship\n\nnetworks. In NIPS, pages 1593\u20131602. 2017.\n\n[33] Y. Lu, A. Kumar, S. Zhai, Y. Cheng, T. Javidi, and R. S. Feris. Fully-adaptive feature sharing in\nmulti-task networks with applications in person attribute classi\ufb01cation. Proc. of CVPR, 2017.\n\n[34] M. T. Luong, Q. V. Le, I. Sutskever, O. Vinyals, and L. Kaiser. Multi-task sequence to sequence\n\nlearning. In Proc. of ICLR, 2016.\n\n[35] S. Merity, N. S. Keskar, and R. Socher. Regularizing and optimizing LSTM language models.\n\nIn Proc. of ICLR, 2018.\n\n[36] S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. CoRR,\n\nabs/1609.07843, 2016.\n\n[37] E. Meyerson and R. Miikkulainen. Beyond shared hierarchies: Deep multitask learning through\n\nsoft layer ordering. In Proc. of ICLR, 2018.\n\n[38] E. Meyerson and R. Miikkulainen. Pseudo-task augmentation: From deep multitask learning to\n\nintratask sharing\u2014and back. In Proc. of ICML, 2018.\n\n[39] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert. Cross-stitch networks for multi-task learning.\n\nIn Proc. of CVPR, 2016.\n\n[40] F. Neumann and C. Witt. On the runtime of randomized local search and simple evolutionary\n\nalgorithms for dynamic makespan scheduling. In Proc. of IJCAI, pages 3742\u20133748, 2015.\n\n[41] A. Paske et al. Automatic differentiation in pytorch. 2017.\n\n11\n\n\f[42] R. Ranjan, V. M. Patel, and R. Chellappa. Hyperface: A deep multi-task learning framework\nfor face detection, landmark localization, pose estimation, and gender recognition. CoRR,\nabs/1603.01249, 2016.\n\n[43] S.-A. Rebuf\ufb01, H. Bilen, and A. Vedaldi. Learning multiple visual domains with residual adapters.\n\nIn NIPS, pages 506\u2013516. 2017.\n\n[44] S.-A. Rebuf\ufb01, H. Bilen, and A. Vedaldi. Ef\ufb01cient parametrization of multi-domain deep neural\n\nnetworks. In Proc. of CVPR, pages 8119\u20138127, 2018.\n\n[45] C. Rosenbaum, T. Klinger, and M. Riemer. Routing networks: Adaptive selection of non-linear\n\nfunctions for multi-task learning. In Proc. of ICLR, 2018.\n\n[46] M. L. Seltzer and J. Droppo. Multi-task learning in deep neural networks for improved phoneme\n\nrecognition. In Proc. of ICASSP, pages 6965\u20136969, 2013.\n\n[47] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. V. Le, G. E. Hinton, and J. Dean. Outra-\ngeously large neural networks: The sparsely-gated mixture-of-experts layer. In Proc. of ICLR,\n2017.\n\n[48] K. O. Stanley, D. B. D\u2019Ambrosio, and J. Gauci. A hypercube-based encoding for evolving\n\nlarge-scale neural networks. Arti\ufb01cial Life, 15:185\u2013212, 2009.\n\n[49] D. Sudholt. On the robustness of evolutionary algorithms to noise: Re\ufb01ned results and an\n\nexample where noise helps. In Proc. of GECCO, pages 1523\u20131530, 2018.\n\n[50] R. S. Sutton and A. G. Barto. Introduction to reinforcement learning. MIT Press, 1998.\n[51] Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu.\n\nDistral: Robust multitask reinforcement learning. In NIPS, pages 4499\u20134509. 2017.\n\n[52] S. Thrun and L. Pratt. Learning to Learn. 2012.\n[53] C. Witt. Tight bounds on the optimization time of a randomized search heuristic on linear\n\nfunctions. Combinatorics, Probability and Computing, 22(2):294\u2013318, 2013.\n\n[54] L. A. Wolsey and G. L. Nemhauser. Integer and combinatorial optimization. John Wiley &\n\nSons, 2014.\n\n[55] Z. Wu, C. Valentini-Botinhao, O. Watts, and S. King. Deep neural networks employing multi-\ntask learning and stacked bottleneck features for speech synthesis. In Proc. of ICASSP, pages\n4460\u20134464, 2015.\n\n[56] Y. Yang and T. Hospedales. A uni\ufb01ed perspective on multi-domain and multi-task learning. In\n\nProceedings of ICLR, 2015.\n\n[57] Y. Yang and T. Hospedales. Deep multi-task representation learning: A tensor factorisation\n\napproach. In Proc. of ICLR, 2017.\n\n[58] S. Zagoruyko and N. Komodakis. Wide residual networks. CoRR, abs/1605.07146, 2016.\n[59] W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. CoRR,\n\nabs/1409.2329, 2014.\n\n[60] H. Zeng, M. D. Edwards, G. Liu, and D. K. Gifford. Convolutional neural network architectures\n\nfor predicting dna-protein binding. Bioinformatics, 32(12):i121\u2013i127, 2016.\n\n[61] Z. Zhang, L. Ping, L. C. Chen, and T. Xiaoou. Facial landmark detection by deep multi-task\n\nlearning. In Proc. of ECCV, pages 94\u2013108, 2014.\n\n12\n\n\f", "award": [], "sourceid": 4315, "authors": [{"given_name": "Elliot", "family_name": "Meyerson", "institution": "Cognizant"}, {"given_name": "Risto", "family_name": "Miikkulainen", "institution": "The University of Texas at Austin; Cognizant"}]}