{"title": "A graph-theoretic approach to multitasking", "book": "Advances in Neural Information Processing Systems", "page_first": 2100, "page_last": 2109, "abstract": "A key feature of neural network architectures is their ability to support the simultaneous interaction among large numbers of units in the learning and processing of representations. However, how the richness of such interactions trades off against the ability of a network to simultaneously carry out multiple independent processes -- a salient limitation in many domains of human cognition -- remains largely unexplored. In this paper we use a graph-theoretic analysis of network architecture to address this question, where tasks are represented as edges in a bipartite graph $G=(A \\cup B, E)$. We define a new measure of multitasking capacity of such networks, based on the assumptions that tasks that \\emph{need} to be multitasked rely on independent resources, i.e., form a matching, and that tasks \\emph{can} be performed without interference if they form an induced matching. Our main result is an inherent tradeoff between the multitasking capacity and the average degree of the network that holds \\emph{regardless of the network architecture}. These results are also extended to networks of depth greater than $2$. On the positive side, we demonstrate that networks that are random-like (e.g., locally sparse) can have desirable multitasking properties. Our results shed light into the parallel-processing limitations of neural systems and provide insights that may be useful for the analysis and design of parallel architectures.", "full_text": "A graph-theoretic approach to multitasking\n\nNoga Alon\u2217\n\nTel-Aviv University\n\nDaniel Reichman\u2020\n\nUC Berkeley\n\nIgor Shinkar\u2217\nUC Berkeley\n\nTal Wagner\u2217\n\nMIT\n\nSebastian Musslick\nPrinceton University\n\nJonathan D. Cohen \u2021\nPrinceton University\n\nThomas L. Grif\ufb01ths\n\nUC Berkeley\n\nBiswadip Dey\n\nPrinceton University\n\nKayhan Ozcimder\nPrinceton University\n\nAbstract\n\nA key feature of neural network architectures is their ability to support the simulta-\nneous interaction among large numbers of units in the learning and processing of\nrepresentations. However, how the richness of such interactions trades off against\nthe ability of a network to simultaneously carry out multiple independent processes\n\u2013 a salient limitation in many domains of human cognition \u2013 remains largely un-\nexplored. In this paper we use a graph-theoretic analysis of network architecture\nto address this question, where tasks are represented as edges in a bipartite graph\nG = (A \u222a B, E). We de\ufb01ne a new measure of multitasking capacity of such\nnetworks, based on the assumptions that tasks that need to be multitasked rely on\nindependent resources, i.e., form a matching, and that tasks can be multitasked\nwithout interference if they form an induced matching. Our main result is an\ninherent tradeoff between the multitasking capacity and the average degree of the\nnetwork that holds regardless of the network architecture. These results are also\nextended to networks of depth greater than 2. On the positive side, we demonstrate\nthat networks that are random-like (e.g., locally sparse) can have desirable multi-\ntasking properties. Our results shed light into the parallel-processing limitations of\nneural systems and provide insights that may be useful for the analysis and design\nof parallel architectures.\n\n1\n\nIntroduction\n\nOne of the primary features of neural network architectures is their ability to support parallel\ndistributed processing [RMG+86]. The decentralized nature of biological and arti\ufb01cial nets results in\ngreater robustness and fault tolerance when compared to serial architectures such as Turing machines.\nOn the other hand, the lack of a central coordination mechanism in neural networks can result\nin interference between units (neurons) and such interference effects have been demonstrated in\nseveral settings such as the analysis of associative memories [AGS85] and multitask learning [MC89].\n\n\u2217Equal contribution.\n\u2020Equal contribution. Supported by DARPA contract N66001-15-2-4048, Value Alignment in Autonomous\nSystems and Grant: 2014-1600, Sponsor: William and Flora Hewlett Foundation, Project Title: Cybersecurity\nand Internet Policy\n\u2021This publication was made possible through the support of a grant from the John Templeton Foundation.\nThe opinions expressed in this publication are those of the authors and do not necessarily re\ufb02ect the views of the\nJohn Templeton Foundation\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fUnderstating the source of such interference and how it can be prevented has been a major focus of\nrecent research (see, e.g., [KPR+17] and the references therein).\nWhile the stark limitation of our ability to carry out multiple tasks simultaneously, i.e., multitask, is\none of the most widely documented phenomena in cognitive psychology [SS77], the sources for this\nlimitation are still unclear. Recently, a graph-theoretic model [FSGC14, MDO+16] has suggested\nthat interference effects may explain the limitations of the human cognitive system in performing\nmultiple task processes at the same time. This model consists of a simple 2-layer feed-forward\nnetwork represented by a bipartite graph G = (A \u222a B, E) wherein the vertex set is partitioned into\ntwo disjoint sets of nodes A and B, representing the inputs and the outputs of tasks respectively. An\nedge (a, b) \u2208 E corresponds to a directed pathway from the input layer to the output layer in the\nnetwork that is taken to represent a cognitive process (or task4) that maps an input to an output. In\nmore abstract terms, every vertex in a \u2208 A is associated with a set of inputs Ia, every vertex in B is\nassociated with a set of outputs Ob and the edge (a, b) is associated with a function fa,b : Ia \u2192 Ob.\n5 In this work, we also consider deeper architectures with r > 2 layers, where edges correspond\nto mappings between nodes from consecutive layers and a path P from the input (\ufb01rst) layer to the\noutput (last) layer is simply the composition of the mappings on the edges in P . The model above is\nquite general and simple modi\ufb01cations of it may apply to other settings. For example, we can assume\nthe vertices in A are senders and vertices in B are receivers and that a task associated with an edge\ne = (a, b) is transmitting information from a to b along a communication channel e.\nGiven a 2-layer network, a task set is a set of edges T \u2286 E. A key assumption made in [MDO+16,\nFSGC14] that we adopt as well is that all task sets that need to be multitasked in parallel form a\nmatching, namely, no two edges in T share a vertex as an endpoint. This assumption re\ufb02ects a\nlimitation on the parallelism of the network that is similar to the Exclusive Read Exclusive Write\n(EREW) model in parallel RAM, where tasks cannot simultaneously read from the same input or\nwrite to the same output. Similarly, for depth r > 2 networks, task sets correspond to node disjoint\npaths from the input layer to the output layer. For simplicity, we shall mostly focus from now on the\ndepth 2 case with |A| = |B| = n.\nIn [MDO+16, FSGC14] it is suggested that concurrently executing two tasks associated with two\n(disjoint) edges e and f will result in interference if e and f are connected by a third edge h.\nThe rationale for this interference assumption stems from the distributed operation of the network\nthat may result in the task associated with h becoming activated automatically once its input and\noutput are operating, resulting in interference with the tasks associated with e and f. Therefore,\n[MDO+16, FSGC14] postulate that all tasks within a task set T can be performed in parallel\nwithout interferences only if the edges in T form an induced matching. Namely, no two edges\nin T are connected by a third edge. Interestingly, the induced matching condition also arises in\nthe communication setting [BLM93, AMS12, CK85], where it is assumed that messages between\nsenders and receivers can be reliably transmitted if the edges connecting them forms an induced\nmatching. Following the aforementioned interference model, [MDO+16] de\ufb01ne the multitasking\ncapability of a bipartite network G as the maximum cardinality of an induced matching in G.\nIt has been demonstrated that neural network architectures are subject to a fundamental tradeoff\nbetween learning ef\ufb01ciency that is promoted by an economic use of shared representations between\ntasks, on the one hand, and the ability of to execute multiple tasks independently, on the other hand\n[MS\u00d6+17]. Namely, it is suggested that as the average degree d (\u201cef\ufb01ciency of representations\u201d\n\u2013 larger degree corresponds to more economical use of shared representations between tasks) of\nG increases, the \u201cmultitasking ability\u201d should decay in d [FSGC14]. That is, the cardinality of\nthe maximal induced matching should be upper bounded by f (d)n with limd\u2192\u221e f (d) = 0. This\nprediction was tested and supported on certain architectures by numerical simulations in [MDO+16,\nFSGC14], where it was suggested that environmental constraints push towards ef\ufb01cient use of\nrepresentations which inevitably limits multitasking. Establishing such as a tradeoff is of interest, as\n\n4We view a task as constituting a simple mechanistic instantiation of a cognitive process, consistent with\nNeisser\u2019s original de\ufb01nition [Nei67]. According to this de\ufb01nition a task process (e.g. color naming) is a mapping\nfrom an input space (e.g. colors) to an output space (verbal). Within this framework the decision of what\nconstitutes an input space for a task is left to the designer and may be problem-speci\ufb01c. The modeling of more\ncomplex tasks might require to extend this framework to multidimensional input spaces. This would allow to\ncapture scenarios in which tasks are partially overlapping in terms of their input and output spaces.\n\n5The function fa,b is hypothesized to be implemented by a gate used in neural networks such as sigmoid or\n\nthreshold gate.\n\n2\n\n\fFigure 1: In the depicted bipartite graph, the node shading represents the bipartition. The blue edges\nform an induced matching, which represents a large set of tasks that can be multitasked. However,\nthe red edges form a matching in which the largest induced matching has size only 1. This represents\na set of tasks that greatly interfere with each other.\n\nFigure 2: Hypercube on 8 nodes. Node shading represents the bipartition. On the left, the blue edges\nform an induced matching of size 2. On the right, the red edges form a matching of size 4 whose\nlargest induced matching has size 1. Hence the multitasking capacity of the hypercube is at most 1/4.\n\nit can identify limitations of arti\ufb01cial nets that rely on shared representations and aid in designing\nsystems that attain an optimal tradeoff. More generally, establishing a connection between graph-\ntheoretic parameters and connectionist models of cognition consists of a new conceptual development\nthat may apply to domains beyond multitasking.\nIdentifying the multitasking capacity of G = (A \u222a B, E) with the size of its maximal induced\nmatching has two drawbacks. First, the existence of some, possibly large, set of tasks that can be\nmultitasked does not preclude the existence of a (possibly small) set of critical tasks that greatly\ninterfere with each other (e.g., consider the case in which a complete bipartite graph Kd,d occurs\nas a subgraph of G. This is illustrated in Figure 1). Second, it is easy to give examples of graphs\n(where |A| = |B| = n) with average degree \u2126(n) that contain an induced matching of size n/2\n(for example, two copies of complete bipartite graph connected by a matching: see Figure 1 for\nan illustration). Hence, it is impossible to upper bound the multitasking capacity of every network\nwith average degree d by f (d)n with f vanishing as the average degree d tends in\ufb01nity. Therefore,\nthe generality of the suggested tradeoff between ef\ufb01ciency and concurrency is not clear under this\nde\ufb01nition.\nOur main contribution is a novel measure of the multitasking capacity that is aimed at solving the\n\ufb01rst problem, namely networks with \u201chigh\u201d capacity which contain a \u201csmall\u201d task set whose edges\nbadly interfere with one another. In particular, for a parameter k we consider every matching of size\nk, and ask whether every matching M of size k contains a large induced matching M(cid:48) \u2286 M. This\nmotivates the following de\ufb01nition (see Figure 2 for an illustration).\nDe\ufb01nition 1.1. Let G = (A \u222a B, E) be a bipartite graph with |A| = |B| = n, and let k \u2208 N be a\nparameter. We say that G is a (k, \u03b1(k))-multitasker if for every matching M in G of size |M| \u2264 k,\nthere exists an induced matching M(cid:48) \u2286 M such that\n\n|M(cid:48)| \u2265 \u03b1(k)|M|.\n\nWe will say that a graph G is an \u03b1-multitasker if it is (n, \u03b1)-multitasker.\nThe parameter \u03b1 \u2208 (0, 1] measures the multitasking capabilities of G, and the larger \u03b1 is the better\nmultitasker G is considered. We call the parameter \u03b1(k) \u2208 (0, 1] the multitasking capacity of G for\nmatchings of size k.\n\nDe\ufb01nition 1.1 generalizes to networks of depth r > 2, where instead of matchings, we consider \ufb01rst\nlayer to last layer node disjoint paths, and instead of induced matchings we consider induced paths,\ni.e., a set of disjoint paths such that no two nodes belonging to different paths are adjacent.\nThe main question we shall consider here is what kind of tradeoffs one should expect between \u03b1, d\nand k. In particular, which network architectures give rise to good multitasking behavior? Should we\n\n3\n\n\fexpect \u201cmultitasking vs. multiplexing\u201d: namely, \u03b1 tending to zero with d for all graphs of average\ndegree d? While our de\ufb01nition of multitasking capacity is aimed at resolving the problem of small\ntask sets that can be poorly multitasked, it turns out to be also related also to the \u201cmultitasking vs.\nmultiplexing\u201d phenomena. Furthermore, our graph-theoretic formalism also gives insights as to how\nnetwork depth and interference are related.\n\n1.1 Our results\n\nWe divide the presentation of the results into two parts. The \ufb01rst part discusses the case of d-regular\ngraphs, and the second part discusses general graphs.\nThe d-regular case: Let G = (A \u222a B, E) be a bipartite d-regular graph with n vertices on each\nside. Considering the case of k = n, i.e., maximal possible induced matchings that are contained\n\u221a\nin a perfect matching (that is a matching of cardinality n), we show that if a d-regular graph is an\n(n, \u03b1(n))-multitasker, then \u03b1(n) = O(1/\nd). Our upper bound on \u03b1(n) establishes an inherent\nlimitation on the multitasking capacity of any network. That is, for any in\ufb01nite family of networks\nwith average degree tending to in\ufb01nity it holds that \u03b1(n) must tend to 0 as the degree grows. In fact,\nwe prove that degree of the graph d constrains the multitasking capacity also for task sets of smaller\nsizes. Speci\ufb01cally, for all k that is suf\ufb01ciently larger than \u2126(n/d) it holds that \u03b1(k) tends to 0 as d\nincreases. In this version of the paper we prove this result for k > n/d1/4. The full version of this\npaper [ACD+] contains the statement and the result that holds for all d > \u2126( n\nTheorem 1.2. Let G = (A\u222a B, E), be a d-regular (k, \u03b1(k))-multitasker graph with |A| = |B| = n.\nIf n/d1/4 \u2264 k \u2264 n, then \u03b1(k) \u2264 O( n\n\u221a\n). In particular, there exists a perfect matching in G that\ndoes not contain an induced matching of size larger than O(n/\n\nd ).\n\n\u221a\n\nd).\n\nk\n\nd\n\n1\u221a\nd log d ). The precise statement appear in the full version of the paper [ACD+].\n\nFor task sets of size n, Theorem 1.2 is tight up to logarithmic factors, as we provide a construction of\nan in\ufb01nite family of d-regular graph, where every matching of size n contains an induced matching\nof size \u2126(\nFor arbitrary values of k \u2264 n it is not hard to see that every d-regular graph achieves \u03b1(k) \u2265 1\n2d. We\nshow that this naive bound can be asymptotically improved upon, by constructing an \u03b1-multitaskers\nwith \u03b1 = \u2126( log d\nd ). The construction is based on bipartite graphs which have good spectral expansion\nproperties. For more details see the full version of the paper [ACD+].\nWe also consider networks of depth r > 2 6. We generalize our ideas for depth 2 networks by upper-\nbounding the multitasking capacity of arbitrary d-regular networks of depth r by O((r/d ln(r))1\u22121/r).\n1\u221a\nObserve that as we show that there are d-regular bipartite graphs with \u03b1(n) =\nd log d, this implies\nthat for tasks sets of size n, networks of depth 2 < r (cid:28) d incur interference which is strictly worse\nthan depth 2 networks. We believe that interference worsens as r increases to r + 1 (for r > 2),\nalthough whether this is indeed the case is an open question.\n\n(cid:16) log n\n\n(cid:17)1/3\n\nd\n\nThe irregular case: Next we turn to arbitrary, not necessarily regular, graphs. We show that\nfor an arbitrary bipartite graph with n vertices on each side and average degree d its multitasking\ncapacity \u03b1(n) is upper bounded by O\n. That is, when the average degree is concerned,\nthe multitasking capacity of a graph tends to zero, provided that the average degree of a graph is\n\u03c9(log n).\nTheorem 1.3. Let G = (A \u222a B, E), be a bipartite graph of average degree d with |A| = |B| = n.\nIf G is an \u03b1-multitasker then \u03b1 \u2264 O(( log n\nFor dense graphs satisfying d = \u2126(n) (which is studied in [FSGC14]), we prove a stronger upper\nbound of \u03b1(n) = O( 1\u221a\nWe also show that there are multitaskers of average degree \u2126(log log n), with \u03b1 > 1/3 \u2212 \u0001. Hence,\nin contrast to the regular case, for the multitasking capacity to decay with average degree d, we must\nassume that d grows faster than log log n. The details behind this construction, which build on ideas\nin [Pyb85, PRS95], appear in full version of this paper [ACD+].\n\nn ) using the Szemer\u00e9di regularity lemma. See Theorem 3.9 for details.\n\nd )1/3).\n\n6We think of r as a constant independent of n and d as tending to in\ufb01nity with n.\n\n4\n\n\fFinally, for any d \u2208 N and for all \u03b1 \u2208 (0, 1/5) we show a construction of a graph with average degree\nd that is a (k, \u03b1)-multitaskers for all k \u2264 \u2126(n/d1+4\u03b1). Comparing this to the foregoing results, here\nwe do not require that d = O(log log n). That is, allowing larger values of d allows us to construct\nnetworks with constant multitasking capacities, albeit only with respect to matchings whose size is at\nmost n/d1+4\u03b1. See Theorem 3.10 for details.\n\n2 Preliminaries\nA matching M in a graph G is a set of edges {e1, ..., em} such that no two edges in M share a\ncommon vertex. If G has 2n vertices and |M| = n, we say that M is a perfect matching. By Hall\nTheorem, every d-regular graph with bipartition (A, B) has a perfect matching. A matching M is\ninduced if there are no two distinct edges e1, e2 in M, such that there is an edge connecting e1 to\ne2. Given a graph G = (V, E) and two disjoint sets A, B \u2286 V we let e(A, B) be the set of edges\nwith one endpoint in A and the other in B. For a subset A, e(A) is the set of all edges contained in A.\nGiven an edge e \u2208 E, we de\ufb01ne the graph G/e obtained by contracting e = (u, v) as the graph with\na vertex set (V \u222a ve) \\ {u, v}. The vertex ve is connected to all vertices in G neighboring u or v. For\nall other vertices x, y \u2208 V \\ {u, v}, they form an edge in G/e if and only if they were connected in\nG. Contracting a set of edges, and in particular contracting a matching, means contracting the edges\none by one in an arbitrary order.\nGiven a subset of vertices U \u2286 V , the subgraph induced by U, denoted by G[U ] is the graph whose\nvertex set is U and two vertices in U are connected if and only if they are connected in G. For a set\nof edges E(cid:48) \u2286 E, denote by G[E(cid:48)] the graph induced by all vertices incident to an edge in E(cid:48). We\nwill use the following simple observation throughout the paper.\nLemma 2.1. Let M be a matching in G, and let davg be the average degree of G[M ]. If we contract\n\nall edges in M in G[M ], then the resulting graph (cid:101)G[M ] has average degree at most 2davg \u2212 2.\nProof. G[M ] contains 2|M| vertices and davg|M| edges. The result follows as (cid:101)G[M ] has |M|\n\nvertices and at most davg|M| \u2212 |M| edges.\n\nn\n\ndavg+1 .\n\nAn independent set in a graph G = (V, E) is a set of vertices that do not span an edge. We will use\nthe following well known fact attributed to Turan.\nLemma 2.2. Every n-vertex graph with average degree davg contains an independent set of size at\nleast\nLet G = (V, E) be a bipartite graph, k an integer and \u03b1 \u2208 (0, 1], a parameter. We de\ufb01ne the\n(\u03b1, k)-matching graph H(G, \u03b1, k) = (L, R, F ) to be a bipartite graph, where L is the set of all\nmatchings of size k in G, R is the set of all induced matchings of size \u03b1k in G, and a vertex vM \u2208 L\n(corresponding to matching M of size k) is connected to a vertex uM(cid:48) (corresponding to an induced\nmatching M(cid:48) of size \u03b1k) if and only if M(cid:48) \u2286 M. We omit \u03b1, k, G from the notation of H when it\nwill be clear from the context. We will repeatedly use the following lemma in upper bounding the\nmultitasking capacity in graph families.\nLemma 2.3. Suppose that the average degree of the vertices in L in the graph H(G, \u03b1, k) is strictly\nsmaller than 1. Then \u03b1(k) < \u03b1.\n\nProof. By the assumption, L has a vertex of degree 0. Hence there exist a matching of size k in G\nthat does not contain an induced matching of size \u03b1k.\n\n3 Upper bounds on the multitasking capacity\n\n3.1 The regular case\n\nIn this section we prove Theorem 1.2 that upper bounds the multitasking capacity of arbitrary d-\n\u221a\nregular multitaskers. We start the proof of Theorem 1.2 with the case k = n. The following theorem\nshows that d-regular (k = n, \u03b1)-multitaskers must have \u03b1 = O(1/\n\nd).\n\n5\n\n\fTheorem 3.1. Let G = (A \u222a B, E) be a bipartite d-regular graph where |A| = |B| = n. Then G\ncontains a perfect matching M such that every induced matching M(cid:48) \u2286 M has size at most 9n\u221a\n\n.\n\nd\n\nFor the proof, we need bounds on the number of perfect matchings in d-regular bipartite graphs.\nLemma 3.2. Let G = (A, B, E), be a bipartite d-regular graph where |A| = |B| = n. Denote by\nM (G) the number of perfect matchings in G. Then\n\n(cid:19)n \u2264\n\n(cid:18) d\n\ne\n\n(cid:18) (d \u2212 1)d\u22121\n\ndd\u22122\n\n(cid:19)n \u2264 M (G) \u2264 (d!)n/d.\n\nThe lower bound on M (G) is due to Schrijver [Sch98]. The upper bound on M (G) is known as\nMinc\u2019s conjecture, which has been proven by Bregman [Bre73].\nProof of Theorem 3.1. Consider H(G, \u03b1, n), where \u03b1 will be determined later. Clearly |R| \u2264\n\u03b1 )2\u03b1n. By the upper bound in Lemma 3.2, every induced matching of size \u03b1n can be\n\n(cid:1)2 \u2264 ( e\n(cid:0) n\ncontained in at most (d!)(1\u2212\u03b1)n/d perfect matchings. By the lower bound in Lemma 3.2, |L| \u2265(cid:0) d\n\n(cid:1)n.\n\n\u03b1n\n\nTherefore, the average degree of the the vertices in L is at most\n(cid:1)n\ne )d)(1\u2212\u03b1)n/d\n\n\u03b1 )2\u03b1n \u00b7 (d!)(1\u2212\u03b1)n/d\n( e\n\n\u221a\n\u03b1 )2\u03b1n \u00b7 (\n\n\u2264 ( e\n\n(cid:0) d\n\n2\u03c0d( d\n\n=\n\ne\n\n(cid:0) d\n(cid:1)n\n(cid:113) e3\n\ne\n\n1\u2212\u03b1\nSetting \u03b1 > 2\n2\u03b1d < 2 for all such \u03b1.\nTherefore in this setting, the average degree of the vertices in L is smaller than 1, which concludes\nthe proof by Lemma 2.3. This completes the proof of the theorem.\n\n2, and it can be veri\ufb01ed that (2\u03c0d)\n\nd yields e3\n\n\u03b12d < 1\n\n(cid:18) e3\n\n\u03b12d\n\ne\n\n(cid:19)\u03b1n\n\n.\n\n\u00b7 (2\u03c0d)\n\n1\u2212\u03b1\n2\u03b1d\n\nWe record the following simple observation, which is immediate from the de\ufb01nition.\nProposition 3.3. If G is a (k, \u03b1)-multitasker, then for all 1 < \u03b2 \u2264 n/k, the graph G is a (\u03b2k, \u03b1\n\u03b2 )-\nmultitasker.\n\nTheorem 1.2 follows by combining Theorem 3.1 with (the contrapositive of) Proposition 3.3.\n\n3.2 Upper bounds for networks of depth larger than 2\n\nA graph G = (V, E) is a network with r layers of width n and degree d, if V is partitioned into r\nindependent sets V1, . . . , Vr of size n each, such that each (Vi, Vi+1) induces a d-regular bipartite\ngraph for all i < r, and there are no additional edges in G.\nA top-bottom path in G is a path v1, . . . , vr such that vi \u2208 Vi for all i \u2264 r, and vi, vi+1 are neighbors\nfor all i < r.\nA set of node-disjoint top-bottom paths p1, . . . , pk is called induced if for every two edges e \u2208 pi\nand e(cid:48) \u2208 pj such that i (cid:54)= j, there is no edge in G connecting e and e(cid:48).\nFact 3.4. A set of node-disjoint top-bottom paths p1, . . . , pk is induced if and only if for every i < r\nit holds that (p1 \u222a . . . \u222a pk) \u2229 E(Vi, Vi+1) is an induced matching in G.\nWe say that a network G as above is a (k, \u03b1)-multitasker if every set of k node-disjoint top-bottom\npaths contains an induced subset of size at least \u03b1k.\n\nTheorem 3.5. If G is an (n, \u03b1)-multitasker then \u03b1 < e\n\n(cid:16) e\u00b7r\n\nd ln(r)\n\n(cid:17)1\u2212 1\n\nr .\n\nProof. Let H = (L, R; EH ) be the bipartite graph in which side L has a node for each set of n\nnode-disjoint top-bottom paths in G, side R has a node for each induced set of \u03b1n node-disjoint\ntop-bottom paths in G, and P \u2208 L, P (cid:48) \u2208 R are adjacent iff P (cid:48) \u2282 P . Let D be the maximum\ndegree of side R. We wish to upper-bound the average degree of side L, which is upper-bounded by\nD|R|/|L|.\n\n(cid:1)r. It is a simple observation that |L| equals(cid:81)\n\n|R| is clearly upper bounded by(cid:0) n\n\ni<r mi, where\nmi denotes the number of perfect matchings in the bipartite graph G[Vi \u222a Vi+1]. Since this graph is\n\n\u03b1n\n\n6\n\n\fd-regular, by the Falikman-Egorichev proof of the Van der Waerden conjecture ([Fal81], [Ego81]), or\nby Schrijver\u2019s lower bound, we have mi \u2265 (d/e)n and hence |L| \u2265 (d/e)n(r\u22121). To upper bound\nD, \ufb01x P (cid:48) \u2208 R, and let G(cid:48) be the network resulting by removing all nodes and edges in P (cid:48) from G.\nThis removes exactly \u03b1n nodes from each layer Vi; denote by V (cid:48)\ni the remaining nodes in this layer in\nG(cid:48). It is a straightforward observation that D equals the number of sets of (1 \u2212 \u03b1)n node-disjoint\ntop-bottom paths in G(cid:48). Each such set decomposes into M1, . . . , Mr\u22121 such that Mi is a perfect\ni where m(cid:48)\nmatching on G(cid:48)[V (cid:48)\ni\u22121 m(cid:48)\ni denotes the number of\ni+1]. The latter is a bipartite graph with (1\u2212\u03b1)n nodes on each side and\nperfect matchings in G(cid:48)[V (cid:48)\ni \u2264 (d!)(1\u2212\u03b1)n/d. Consequently,\nmaximum degree d, and hence by the Bregman-Minc inequality, m(cid:48)\nD \u2264 (d!)(1\u2212\u03b1)n(r\u22121)/d.\nPutting everything together, we \ufb01nd that the average degree of side L is upper bounded by\n\ni+1] for each i < r. Therefore D \u2264(cid:81)\n\ni , V (cid:48)\n\ni , V (cid:48)\n\n|L| \u2264 (d!)(1\u2212\u03b1)n(r\u22121)/d \u00b7(cid:0) n\n\n(d/e)n(r\u22121)\n\nD|R|\n\n\u03b1n\n\n(cid:1)r\n\n\u221a\n\u2264 (\n\n(cid:18)\n\n=\n\n(2\u03c0d)\n\n2\u03b1d \u00b7 e\n1\u2212\u03b1\nd\n\n\u03b1\n\n2\u03c0d(d/e)d)(1\u2212\u03b1)n(r\u22121)/d \u00b7 ( e\n\n\u03b1 )\u03b1nr\n\n(cid:16) e\n\nr\u22121(cid:19)\u03b1n(r\u22121)\n(cid:17) r\n\n(d/e)n(r\u22121)\n\n.\n\n(1)\n\nFor C = r/ ln(r) we will show that if \u03b1 \u2265 e(eC/d)1\u2212 1\nimplies side L has a node of degree 0, a contradiction. To this end, note that for this \u03b1 we have\n\nr then above bound is less than 1, which\n\n(cid:16) e\n\n(cid:17) r\n\ne\nd\n\nr\u22121 \u2264 1\nC\n\n=\n\nln(r)\n\nr\n\n,\n\n(2)\n\n\u03b1\n\nand\n\n(2\u03c0d)(1\u2212\u03b1)/(2\u03b1d) \u2264 (2\u03c0d)1/(2\u03b1d) \u2264 (2\u03c0d)1/(2eC1\u22121/rd1/r).\n\nFact 3.6. For every constants \u03b3, \u03b2 > 0, the function f (d) = (\u03b3d)1/(\u03b2d1/r) is maximized at d = er/\u03b3,\nand f (er/\u03b3) = er\u03b31/r/\u03b2e.\nPlugging this above (and using r \u2265 2), we obtain\n\u221a\n(2\u03c0d)(1\u2212\u03b1)/(2\u03b1d) \u2264 (2\u03c0d)1/(2eC1\u22121/rd1/r) \u2264 er(2\u03c0eC)1/r/(2Ce2) \u2264 eln(r)\nand plugging this with Equation (2) into Equation (1) yields D|R|\n\n2\u03c0\u00b7r1/r/(2e3/2) \u2264 \u221a\n\n|L| < 1, as required.\n\nr,\n\n3.3 The irregular case\n\nBelow we consider general (not necessarily regular) graphs with average degree d, and prove\nTheorem 1.3. In order to prove it, we \ufb01rst show a limitation on the multitasking capacity of graphs\nwhere the average degree of a graph is d, and the maximum degree is bounded by a parameter \u2206.\nTheorem 3.7. Let G be a bipartite graph with n nodes on each side, average degree d, and maximum\ndegree \u2206. If G is an \u03b1-multitasker, then \u03b1 < O(\u2206 1\n\n3 /d 2\n\n3 ).\n\nA proof of Theorem 3.7 can be found in the full version of this paper [ACD+].\nNote that Theorem 3.7 does not provide any nontrivial bounds on \u03b1 when \u2206 exceeds d2. However, we\nuse it to prove Theorem 1.3, which establishes nearly the same upper bound with no assumption on \u2206.\nTo do so we need the following lemma, which is also proved in the full version of this paper [ACD+].\nLemma 3.8. Every bipartite graph with 2n vertices and average degree d > 4 log n contains a\nsubgraph in which the average degree is at least b = d\n4 log n and the maximum degree is at most 2b.\n\nWe can now prove Theorem 1.3.\nProof of Theorem 1.3. By Lemma 3.8 G contains a subgraph with average degree b \u2265 d/(4 log n)\nand maximum degree at most 2b. The result thus follows from Theorem 3.7.\n\nAs in the regular case, for smaller values of k we can obtain a bound of \u03b1 = O((cid:112) n\n\ndk ) for (k, \u03b1)-\n\nmultitaskers. See the full version of this paper [ACD+] for the precise details.\nWhen the graph is dense, we prove the following better upper bounds on \u03b1.\n\n7\n\n\fTheorem 3.9. Let G be a bipartite graph with n vertices on each side, and average degree d = \u2126(n).\nIf G is an \u03b1-multitasker, then \u03b1 < O(( 1\nProof. By the result in [PRS95] (see Theorem 3) the graph G contains a d(cid:48)-regular bipartite graph\nwith d(cid:48) = \u2126(n). The result thus follows from our upper bound for regular graphs as stated in\nTheorem 1.2.\n\nn )1/2).\n\n3.4 A simple construction of a good multitasker\n\nWe show that for small constants \u03b1, we may achieve a signi\ufb01cant increase in k show existence of a\n(O(n/d1+4\u03b1), \u03b1)-multitaskers for any 0 < \u03b1 < 1/5.\nTheorem 3.10. Fix d \u2208 N, and let n \u2208 N be suf\ufb01ciently large. For a \ufb01xed 0 < \u03b1 < 1/5, there exists\na (k, \u03b1)-multitasker with n vertices on each side, average degree d, for all k \u2264 \u2126(n/d1+4\u03b1).\n\n2 ( 1\n\nProof. It is known (see, e.g., [FW16]) that for suf\ufb01ciently large n, there exist an n-vertex graph\nG = (V, E) with average degree d such that every subgraph of G of size s \u2264 O(n/d1+4\u03b1) has\n\u03b1 \u2212 1). De\ufb01ne a bipartite graph H = (A \u222a B, EH ) such that A and B are\naverage degree at most 1\ntwo copies of V , and for a \u2208 A and b \u2208 B we have (a, b) \u2208 EH if and only if (a, b) \u2208 E. We get\nthat the average degree of H is d, and for any two A(cid:48) \u2286 A and B(cid:48) \u2286 B such that |A(cid:48)| = |B(cid:48)| \u2264 s/2,\n\u03b1 \u2212 1. Consider a matching M of size s/2 in H. By\nthe average degree of H[A(cid:48) \u222a B(cid:48)] is at most 1\n\u03b1 \u2212 1.\nLemma 2.1, if we contract all edges of the matching, we get a graph of average degree at most 2\n2 \u03b1|M|, which corresponds\nBy Lemma 2.2, such a graph contains an independent set of size at least 1\nto a large induced matching contained in M. This concludes the proof of the theorem.\n\n4 Conclusions\n\nWe have considered a new multitasking measure for parallel architectures that is aimed at providing\nquantitative measures of parallel processing capabilities of neural systems. We established an inherent\ntradeoff between the density of the network and its multitasking capacity that holds for every graph\nthat is suf\ufb01ciently dense. This tradeoff is rather general and it applies to regular graphs, to irregular\ngraphs and to layered networks of depth greater than 2. We have also obtained quantitative insights.\nFor example, we have provided evidence that interference increases as depth increases from 2 to\nr > 2, and demonstrated that irregular graphs allow for better multitasking than regular graphs for\ncertain edge densities. Our \ufb01ndings are also related to recent efforts in cognitive neuroscience to\npinpoint the reason for the limitations people experience in multiasking control demanding tasks.\nWe have found that networks with pseudorandom properties (locally sparse, spectral expanders) have\ngood multitasking capabilities. Interestingly, previous works have documented the bene\ufb01ts of random\nand pseudorandom architectures in deep learning, Hop\ufb01eld networks and other settings [ABGM14,\nVal00, KP88]. Whether there is an underlying cause for these results remains an interesting direction\nfor future research.\nOur work is limited in several aspects. First, our model is graph-theoretic in nature, focusing\nexclusively on the adjacency structure of tasks and does not consider many parameters that emerge in\nbiological and arti\ufb01cial parallel architectures. Second, we do not address tasks of different weights\n(assuming all tasks have the same weights), stochastic and probabilistic interference (we assume\ninterference occurs with probability 1) and the exact implementation of the functions that compute\nthe tasks represented by edges. A promising avenue for future work will be to evaluate the predictive\nvalidity of \u03b1, that is, the ability to predict parallel processing performance of trained neural networks\nfrom corresponding measures of \u03b1.\nTo summarize, the current work is directed towards laying the foundations for a deeper understanding\nof the factors that affect the tension between ef\ufb01ciency of representation, and \ufb02exibility of processing\nin neural network architectures. We hope that this will help inspire a parallel proliferation of efforts\nto further explore this area.\n\n8\n\n\fReferences\n[ABGM14] Sanjeev Arora, Aditya Bhaskara, Rong Ge, and Tengyu Ma. Provable bounds for\n\nlearning some deep representations. In ICML, pages 584\u2013592, 2014.\n\n[ACD+] Noga Alon, Jonathan D. Cohen, Biswadip Dey, Tom Grif\ufb01ths, Sebastian Musslick,\nKayhan \u00d6zcimder, Daniel Reichman, Igor Shinkar, and Tal Wagner. A graph-theoretic\napproach to multitasking (full version). Available at arXiv:1611.02400, 2017.\n\n[AGS85] Daniel J Amit, Hanoch Gutfreund, and Haim Sompolinsky. Storing in\ufb01nite numbers of\npatterns in a spin-glass model of neural networks. Physical Review Letters, 55(14):1530,\n1985.\n\n[AMS12] Noga Alon, Ankur Moitra, and Benny Sudakov. Nearly complete graphs decomposable\ninto large induced matchings and their applications. In Proceedings of the Forty-Fourth\nannual ACM Symposium on Theory of Computing, pages 1079\u20131090, 2012.\n\n[BLM93] Yitzhak Birk, Nathan Linial, and Roy Meshulam. On the uniform-traf\ufb01c capacity\nIEEE\n\nof single-hop interconnections employing shared directional multichannels.\nTransactions on Information Theory, 39(1):186\u2013191, 1993.\n\n[Bre73] Lev M Bregman. Some properties of nonnegative matrices and their permanents. In\n\nSoviet Math. Dokl, volume 14, pages 945\u2013949, 1973.\n\n[CK85] Imrich Chlamtac and Shay Kutten. On broadcasting in radio networks\u2013problem analysis\nand protocol design. IEEE Transactions on Communications, 33(12):1240\u20131246, 1985.\n\n[Ego81] Gregory P. Egorychev. The solution of van der waerden\u2019s problem for permanents.\n\nAdvances in Mathematics, 42(3):299\u2013305, 1981.\n\n[Fal81] Dmitry I Falikman. Proof of the van der waerden conjecture regarding the permanent of\n\na doubly stochastic matrix. Mathematical Notes, 29(6):475\u2013479, 1981.\n\n[FSGC14] Samuel F Feng, Michael Schwemmer, Samuel J Gershman, and Jonathan D Cohen.\nMultitasking versus multiplexing: Toward a normative account of limitations in the\nsimultaneous execution of control-demanding behaviors. Cognitive, Affective, & Behav-\nioral Neuroscience, 14(1):129\u2013146, 2014.\n\n[FW16] Uriel Feige and Tal Wagner. Generalized girth problems in graphs and hypergraphs.\n\nManuscript, 2016.\n\n[KP88] J\u00e1nos Koml\u00f3s and Ramamohan Paturi. Convergence results in an associative memory\n\nmodel. Neural Networks, 1(3):239\u2013250, 1988.\n\n[KPR+17] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Des-\njardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-\nBarwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings\nof the National Academy of Sciences, pages 3521\u20133526, 2017.\n\n[MC89] Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist\nnetworks: The sequential learning problem. Psychology of learning and motivation,\n24:109\u2013165, 1989.\n\n[MDO+16] Sebastian Musslick, Biswadip Dey, Kayhan Ozcimder, Mostofa Patwary, Ted L Willke,\nand Jonathan D Cohen. Controlled vs. Automatic Processing: A Graph-Theoretic Ap-\nproach to the Analysis of Serial vs. Parallel Processing in Neural Network Architectures.\nIn Proceedings of the 38th Annual Meeting of the Cognitive Science Society (CogSci),\npages 1547\u20131552, 2016.\n\n[MS\u00d6+17] Sebastian Musslick, Andrew Saxe, Kayhan \u00d6zcimder, Biswadip Dey, Greg Henselman,\nand Jonathan D. Cohen. Multitasking capability versus learning ef\ufb01ciency in neural\nnetwork architectures. In 39th Cognitive Science Society Conference, London, 2017.\n\n[Nei67] Ulrich Neisser. Cognitive psychology. Appleton-Century-Crofts, New York, 1967.\n\n9\n\n\f[PRS95] L\u00e1szl\u00f3 Pyber, Vojtech R\u00f6dl, and Endre Szemer\u00e9di. Dense graphs without 3-regular\n\nsubgraphs. Journal of Combinatorial Theory, Series B, 63(1):41\u201354, 1995.\n\n[Pyb85] Laszlo Pyber. Regular subgraphs of dense graphs. Combinatorica, 5(4):347\u2013349, 1985.\n\n[RMG+86] David E Rumelhart, James L McClelland, PDP Research Group, et al. Parallel distributed\nprocessing: Explorations in the microstructure of cognition, vol. 1-2. MIT Press, MA,\n1986.\n\n[Sch98] Alexander Schrijver. Counting 1-factors in regular bipartite graphs. Journal of Combi-\n\nnatorial Theory, Series B, 72(1):122\u2013135, 1998.\n\n[SS77] Walter Schneider and Richard M Shiffrin. Controlled and automatic human information\nprocessing: I. Detection, search, and attention. Psychological Review, 84(1):1\u201366, 1977.\n\n[Val00] Leslie G Valiant. Circuits of the Mind. Oxford University Press, 2000.\n\n10\n\n\f", "award": [], "sourceid": 1270, "authors": [{"given_name": "Noga", "family_name": "Alon", "institution": "Tel Aviv University"}, {"given_name": "Daniel", "family_name": "Reichman", "institution": "University of California, Berkeley"}, {"given_name": "Igor", "family_name": "Shinkar", "institution": "UC Berkeley"}, {"given_name": "Tal", "family_name": "Wagner", "institution": "MIT"}, {"given_name": "Sebastian", "family_name": "Musslick", "institution": null}, {"given_name": "Jonathan", "family_name": "Cohen", "institution": "Princeton University"}, {"given_name": "Tom", "family_name": "Griffiths", "institution": "UC Berkeley"}, {"given_name": "Biswadip", "family_name": "dey", "institution": "Princeton University"}, {"given_name": "Kayhan", "family_name": "Ozcimder", "institution": "Princeton University"}]}