{"title": "Backpropagation with Callbacks: Foundations for Efficient and Expressive Differentiable Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 10180, "page_last": 10191, "abstract": "Training of deep learning models depends on gradient descent and end-to-end\ndifferentiation. Under the slogan of differentiable programming, there is an\nincreasing demand for efficient automatic gradient computation for emerging\nnetwork architectures that incorporate dynamic control flow, especially in NLP.\n\nIn this paper we propose an implementation of backpropagation using functions\nwith callbacks, where the forward pass is executed as a sequence of function\ncalls, and the backward pass as a corresponding sequence of function returns.\nA key realization is that this technique of chaining callbacks is well known in the\nprogramming languages community as continuation-passing style (CPS). Any\nprogram can be converted to this form using standard techniques, and hence,\nany program can be mechanically converted to compute gradients.\n\nOur approach achieves the same flexibility as other reverse-mode automatic\ndifferentiation (AD) techniques, but it can be implemented without any auxiliary\ndata structures besides the function call stack, and it can easily be combined\nwith graph construction and native code generation techniques through forms of\nmulti-stage programming, leading to a highly efficient implementation that\ncombines the performance benefits of define-then-run software frameworks such\nas TensorFlow with the expressiveness of define-by-run frameworks such as PyTorch.", "full_text": "Backpropagation with Continuation Callbacks:\n\nFoundations for Ef\ufb01cient and Expressive\n\nDifferentiable Programming\n\nFei Wang\n\nPurdue University\n\nWest Lafayette, IN 47906\n\nwang603@purdue.edu\n\nJames Decker\n\nPurdue University\n\nWest Lafayette, IN 47906\ndecker31@purdue.edu\n\nXilun Wu\n\nPurdue University\n\nWest Lafayette, IN 47906\n\nGr\u00e9gory Essertel\nPurdue University\n\nWest Lafayette, IN, 47906\n\nTiark Rompf\n\nPurdue University\n\nWest Lafayette, IN, 47906\n\nwu636@purdue.edu\n\ngesserte@purdue.edu\n\ntiark@purdue.edu\n\nAbstract\n\nTraining of deep learning models depends on gradient descent and end-to-end\ndifferentiation. Under the slogan of differentiable programming, there is an increas-\ning demand for ef\ufb01cient automatic gradient computation for emerging network\narchitectures that incorporate dynamic control \ufb02ow, especially in NLP.\nIn this paper we propose an implementation of backpropagation using functions\nwith callbacks, where the forward pass is executed as a sequence of function\ncalls, and the backward pass as a corresponding sequence of function returns. A\nkey realization is that this technique of chaining callbacks is well known in the\nprogramming languages community as continuation-passing style (CPS). Any\nprogram can be converted to this form using standard techniques, and hence, any\nprogram can be mechanically converted to compute gradients.\nOur approach achieves the same \ufb02exibility as other reverse-mode automatic differ-\nentiation (AD) techniques, but it can be implemented without any auxiliary data\nstructures besides the function call stack, and it can easily be combined with graph\nconstruction and native code generation techniques through forms of multi-stage\nprogramming, leading to a highly ef\ufb01cient implementation that combines the per-\nformance bene\ufb01ts of de\ufb01ne-then-run software frameworks such as TensorFlow\nwith the expressiveness of de\ufb01ne-by-run frameworks such as PyTorch.\n\n1\n\nIntroduction\n\nDifferentiable programming (Olah, 2015; LeCun, 2018) refers to a programming model where\nneural networks are truly functional blocks with data-dependent branches and recursion, while\nat the same time being trainable with backpropagation and gradient descent (Rumelhart et al.,\n1986). A programming model of such generality requires both expressivity and ef\ufb01ciency from the\nbackpropagation framework. However, the current generation of tools such as TensorFlow (Abadi\net al., 2015), and PyTorch (Paszke et al., 2017) trade off one for the other.\nInspired by the pattern of forward and backward passes, this paper proposes an implementation of\nbackpropagation using functions with callbacks. Each elementary operation becomes a function\ncall. The forward computation for this operation is performed on function entry, and the backward\ncomputation on function exit. In between, the result of the forward computation is passed to a\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fcallback, which executes the downstream (forward and backward) computations (Figure 1). The use\nof callbacks provides modularity and enables programmers to chain arbitrary operations together.\nWhile programming in this style with explicit callbacks is of course cumbersome, a key realization is\nthat this programming pattern has been well known in the programming languages community for\nmore than 50 years under the name continuation-passing style (CPS) (van Wijngaarden, 1966), and\nthere is a simple and well-studied transformation that converts any program into CPS (Fischer, 1972).\nThis approach achieves the same \ufb02exibility as other de\ufb01ne-by-run reverse-mode automatic differ-\nentiation (AD) techniques (Wengert, 1964; Speelpenning, 1980) and naturally extends to loops,\nsubroutines, and recursive functions. Unlike other approaches, however, it can be implemented\nwithout any auxiliary data structures (often called trace or tape). We implicitly use the call stack as\nour data structure, with the bene\ufb01t that the memory is automatically managed and out-of-scope data\nare freed when no longer needed. Using delimited continuations and shift/reset control operators\n(Danvy and Filinski, 1990), we can make the callbacks implicit, too, and provide an implementation\nof reverse-mode AD solely through operator overloading.\nOur approach can further be combined with existing graph construction and native code generation\ntechniques to provide an expressive de\ufb01ne-then-run computation model, including in-graph functions\nand recursion. In particular, we employ an orthogonal concept called multi-stage programming\n(staging, Taha and Sheard (2000)). Inspired by the natural observation that most programs operate\nin separate stages due to data availability and frequency of operation (J\u00f8rring and Scherlis, 1986),\nprogramming language researches developed tools where a program can be partially evaluated,\nwith code generated for the unevaluated part. The generated code can be in a different (potentially\nlow-level) language, thus removing abstractions (objects, higher-order functions) and improving\nef\ufb01ciency (Taha and Sheard, 2000). Speci\ufb01cally, by utilizing Lightweight Modular Staging (LMS)\n(Rompf and Odersky, 2010), we create a highly ef\ufb01cient and expressive framework dubbed Lantern\nwhich supports both unrestricted control \ufb02ow as found in PyTorch, as well as the computation graph\nrei\ufb01cation in, e.g., TensorFlow.\nWe explain the requisite programming languages concepts and present evaluation results as follows:\n\n\u2022 Section 2 shows how delimited continuations naturally support reverse-mode AD.\n\u2022 Section 3 explains how multi-stage programming orthogonally brings ef\ufb01ciency.\n\u2022 Section 4 evaluates Lantern and demonstrates ef\ufb01ciency and expressivity of our framework.\n\nFinally, Section 5 discusses related work and offers concluding thoughts.\n\n2 Differentiable Programming and Reverse-Mode AD\n\n2.1 Reverse-Mode AD, Explained\nLet v1, v2, ..., vk be the nodes in a computation graph G in a topological ordering (i.e., every node\ncorresponds to some function fi that depends on results of earlier, parental nodes as parameters). For\nneural networks, vk re\ufb02ects the loss L, which is the target of optimization. During reverse-mode AD,\nthe forward pass \ufb01rst traverses the nodes from v1 to vk, computing the result (value) of each node.\nThe backward pass then traverses the nodes from vk to v1, computing the gradient dL/dvi for each\nnode, which de\ufb01nes the effect of tiny changes of vi on the value of L. While dL/dvk is 1.0, dL/dvi\nfor i < k are calculated by the chain rule:\n\n(cid:88)\n\nj\u2208Out(i)\n\n(cid:18) \u2202fj\n\n(cid:19)T dL\n\n\u2202vi\n\ndvj\n\ndL\ndvi\n\n=\n\nHere, Out(i) de\ufb01nes the output nodes of node vi in graph G, and \u2202fj/\u2202vi is the Jacobian matrix of\nthe partial derivative of fj to vi.\n2.2 Reverse-Mode AD as Functions with Callbacks\nNormally, reverse-mode AD is implemented with the help of auxiliary data structures. For instance,\nthe small example\n\nv1 = 0.5\n\nv2 = 0.4\n\nv3 = v1 + v2\n\nv4 = v2 \u2217 v3\n\nv5 = tanh(v4)\n\ncan be represented as the computation graph in Figure 1 (top).\n\n2\n\n\fThe gray arrows (above each node) form the forward pass, and the red arrows (below each node) form\nthe backward pass. Each node is represented as a rounded square, with the upper half containing\nthe formula for computation (N/A for nodes with initial values), and the lower half containing the\nvalue (left) and the gradient (right). The formulas for the backward pass are labeled on the red\narrows, and gradients from multiple arrows are summed together. Operations of formulas can be\ncomputed by a graph iterator which performs a forward pass \ufb01rst, then a backward pass. This is\nsimilar to the implementation of TensorFlow and PyTorch, though TensorFlow creates new nodes for\nbackpropagation by explicit graph transformation/optimization.\nInspired by the \u201cThere and Back Again\u201d (Danvy and Goldberg, 2005) pattern of reverse-mode AD, a\nkey observation is that we can pro\ufb01tably perform these operations as a sequence of function calls,\none for each elementary operation (Figure 1, bottom). In the lower section of the \ufb01gure, the executor\nof every operation is explicitly labeled. The \ufb01rst v1 + v2 operation is performed by the caller of the\nwhole function (possibly grad, denoted g). g calls the \ufb01rst callback k1, which handles the v2 \u2217 v3\noperation, and calls the second callback k2. k2 then computes tanh(v4) and calls the last callback\nk3. k3 only needs to set the gradient of v5 as 1.0. After k3 returns, k2 updates the gradient of v4\nby the chain rule, and returns to k1, which updates the gradients of v2 and v3. Upon k1\u2019s return, g\nupdates the gradients of v1 and v2. The scopes of each function/callback are also highlighted by\ndashed boxes, showing nested scopes of the chain of callbacks. Note that although nodes are retained\nin the \ufb01gure for callbacks, it is easy to see that the values and gradients can be saved on the function\ncall stack: no auxiliary heap-allocated data structures are needed.\n\nFigure 1: Reverse-Mode AD represented as graph nodes (top) and reverse-Mode AD via callbacks\n(bottom)\n\nWith this dual view of chains of nodes and nested function calls (Figure 1), we can see that the call\npath implements the forward propagation, and the return path implements the backward propagation.\nInspired by this idea, we show the Scala implementation of this callback version of reverse-mode AD\nin Figure 2.\n\n2.3\n\nImplementation Using Operator Overloading\n\nOur \ufb01rst implementation in Scala is mechanical, directly following the drawing in Figure 1. As\nshown in the left column of Figure 2, we de\ufb01ne a class NumR with two \ufb01elds: an immutable value x,\nand a mutable gradient d. Each operator in NumR takes a callback k which consumes the intermediate\nNumR (y) as a parameter and handles the following forward pass and the leading backward pass. Once\nthe callback k returns, the gradient of y (the correct value of y.d) should have been computed. Then\nthe operator updates the gradients of the dependent values as side effects, using the value of y.d. On\nthe right column is the de\ufb01nition of the grad operator, an example, and the expected unit test. Note\nthat in the de\ufb01nition of grad we provide the \ufb01nal callback of (r => r.d = 1.0), which is to set up the\ngradient of the \ufb01nal NumR as 1.0. To aid in presentation, the occurrences of callbacks appear shaded.\n\n3\n\n\f// differentiable number type\nclass NumR(val x: Double, var d: Double) {\n\ndef +(that: NumR) = { (k:\n\nNumR=>Unit) =>\n\nval y = new NumR(x + that.x, 0.0); k(y)\nthis.d += y.d; that.d += y.d\n\n}\ndef *(that: NumR) = { (k:\n\nNumR=>Unit) =>\nval y = new NumR(x * that.x, 0.0); k(y)\nthis.d += that.x * y.d; that.d += this.x * y.d\n\n}\n...\n\n}\n\n// differentiation operator\ndef grad(f:NumR => (NumR=>Unit)=>Unit )(x:Double)={\n\nval z = new NumR(x, 0.0)\nf(z)(r => r.d = 1.0)\nz.d\n\n}\n// example: 2*x + x*x*x\nval df = grad { x =>\n\n(2*x) (y1=> ( x*x )(y2=> (y2 *x )(y3=> y1 + y2)))\n\n}\n// unit test\nforAll { x => df(x) = 2 + 3*x*x }\n\nFigure 2: Automatic Differentiation in Scala: reverse-mode AD by callbacks and operator overloading\n(left), and the grad function de\ufb01nition and use case (right). Handling of continuations is highlighted.\nCode \ufb01rst appeared in Wang and Rompf (2018)\nUnfortunately, the example (last shaded box in Figure 2) is coded in a rather cumbersome way, simply\nbecause we must explicitly construct the callbacks for each step (implicit conversion of Int to NumR is\nelided). A natural question, then, is: Could this be simpli\ufb01ed or automated?\n\n2.4\n\nImplementing Reverse-Mode AD with Continuations\n\nThis idea of introducing callbacks for every function result is actually a well-known program transfor-\nmation, named continuation-passing style (CPS), which has been studied in the PL community for\nmore than 50 years (van Wijngaarden, 1966).\nThe concept of continuations is ubiquitous in programming: an if-branch is the choice between two\ncontinuations, an exception or goto is an abortion/change of continuation, etc. However, in a normal,\n\u201cdirect style\u201d of programming, continuations are maintained implicitly by function calls/returns and\nother control \ufb02ow. By contrast, CPS manages control \ufb02ow by passing continuations explicitly (every\nfunction has an extra parameter called continuation k). For instance, while a direct-style function\nreturns its result directly to the calling function, a CPS function takes as an argument \u201cthe rest of\nthe computation\u201d as a function (i.e, continuation), and calls the continuation with the result as a\nparameter. CPS is often used in compilers as an intermediate representation of programs (Appel,\n1992). The transformation of direct-style programs into CPS is also a well-known procedure Fischer\n(1993), shown below (Figure 3, upper).\n\nTransformation to continuation-passing style:\n\n[[if (e1) e2 else e3]] k = [[e1]](v1 \u21d2 if (v1) [[e2]] k else [[e3]] k)\n[[while (e1) e2; e3]] k = def loop() = {[[e1]] (v \u21d2 if (v)[[e2]] loop else [[e3]] k)}; loop()\n[[def f(n1, ...) = e1; e]] k = def f(n1, ..., k(cid:48)) = {[[e1]] k(cid:48)}; [[e]] k\n\n[[e(e1, ...)]] k = [[e]] (v \u21d2 ([[e1]] (v1 \u21d2 (... \u21d2 v(v1, ..., k)...))))\n\nTransformation of delimited control operators shift/reset:\n\n[[shift(k \u21d2 e)]] k(cid:48) = def k(r, k(cid:48)(cid:48)) = k(cid:48)(cid:48)(k(cid:48)(r)); [[e]](x \u21d2 x)\n\n[[reset(e)]] k(cid:48) = k(cid:48)([[e]](x \u21d2 x))\n\nFigure 3: Program Transformation between direct style (left) and CPS (right). [[e]] k denotes a\nprogram e in direct style, transformed with given continuation k.\n\nThe rules in Figure 3 transform direct-style programs to CPS, where the continuations are always\nmaintained as tail calls, which never return to the callers. However, this is insuf\ufb01cient for the callbacks\nneeded in reverse-mode AD, as these callbacks must return. This can be achieved through the use of\ndelimited continuations (Felleisen, 1988), which, as the name suggests, are continuations up to certain\nboundaries, de\ufb01ned by the control delimiters. When arriving at the boundaries, the continuations\nreturn to their caller, possibly with return values. In that sense, delimited continuations are more like\nnormal functions, and they do not have to be tail calls. The remaining key difference is that delimited\ncontinuations are constructed from part of the program.\n\n4\n\n\fDelimited continuations can be generated from direct-style programs supplemented with control\noperators. Several forms of control operators exist, but we will use the pair with the closest relation\nto CPS, named shift/reset (Danvy and Filinski, 1990). The formal rules are shown in Figure 3\n(bottom), which lay out the form of transformations for the shift and reset operators. Modern\ntools (Rompf et al., 2009) further simplify the transformations for delimited continuations to selective\nCPS transformation, where program fragments without shift/reset are kept in direct style (no need\nfor the k(cid:48)(cid:48) parameter in the k function in shift transformation rule).\nGenerally speaking, the reset operator de\ufb01nes the boundary of the delimited continuations, while the\nshift operator captures the delimited continuations. Their roles can be further explained using the\nfollowing toy example.\n\nval a = 1 + reset { 10 + shift { k => k(k(100)) + 1000 } }\nThe delimited continuation is the program between shift and reset (the shaded box above), which\ncan be embodied by replacing the shift construct (the white box above) as function parameter, and\nrewriting the reset block as the function, i.e., continuation (the shaded box below), and then passed\nto the shift block.\n\nval a = 1 + { (k => k(k(100)) + 1000) (x => 10 + x) }\n\nThen the delimited continuation is captured by the shift construct, as the continuation parameter k in\nthe shift block. The \ufb01nal result is then the evaluation of the body of shift.\n\nval a = 1 + { (10 + (10 + 100)) + 1000 } = 1121\n\nIn this way, the program is still written in direct style (with shift and reset operators). However, the\nautomated transformation will reorganize it into CPS format, realizing delimited continuations. Thus,\nthe cumbersome example of Figure 2 can be simpli\ufb01ed by using shift and reset in construction. We\nprovide this implementation below (Figure 4).\n\n// differentiable number type\nclass NumR(val x: Double, var d: Double) {\n\n// differentiation operator\ndef grad(f: NumR => NumR @cps[Unit] )(x: Double) = {\n\ndef +(that: NumR) = shift {(k:NumR=>Unit)=>\n\nval y = new NumR(x + that.x, 0.0); k(y)\nthis.d += y.d; that.d += y.d\n\n}\ndef *(that: NumR) = shift {(k:NumR=>Unit)=>\n\nval y = new NumR(x * that.x, 0.0); k(y)\nthis.d += that.x * y.d; that.d += this.x * y.d\n\n}\n...\n\nval z = new NumR(x, 0.0)\nreset { f(z).d = 1.0 }\nz.d\n\n}\n// example\nval df = grad(x => 2*x + x*x*x)\n// unit test\nforAll { x =>\n\ndf(x) = 2 + 3*x*x\n\n}\n\n}\nFigure 4: Automatic Differentiation in Scala: reverse-mode using delimited continuations with\nshift/reset operators (left), and grad function de\ufb01nition and use case (right). Code \ufb01rst appeared\nin Wang and Rompf (2018)\n\nIn this \ufb01gure, the occurrences of shift/reset and delimited continuations are again shaded. The\nshift/reset program transformation is handled by the Scala compiler accordingly (Rompf et al.,\n2009). The implementation of NumR with shift/reset operators is almost identical to NumR in Figure 2\n(modulo added shift). Note that a shift operator returns a CPS-annotated type A@cps[B, C],\nmeaning that the continuation k in shift is of type (A \u21d2 B), and the body of shift is of type C. When\ntype B equals C, we denote it as A@cps[B]. Importantly, handling of continuations is con\ufb01ned to\nimplementation logic and does not leak into user code (see the example in Figure 4).\nOur approach has some similarity with the seminal paper by Pearlmutter and Siskind (2008) who\nalso formalized reverse-mode AD in a functional style. However, several important aspects are\nsubstantially different. For one, their implementation uses nonlocal code transformations to return a\npair consisting of a value and a backpropagator: x (cid:55)\u2192 (v, dv/dy (cid:55)\u2192 dx/dy) for back propagation. We\napply delimited continuations using shift/reset operators, which hide the nonlocal transformations\nfrom the developer, so that reverse-mode AD can be implemented purely via operator overloading.\nTheir approach is purely functional (no variables are mutated during computations), which needs\nspecial care (a channel) if a lambda uses variables from an outer scope. On the other hand, we allow\n\n5\n\n\flimited mutation of gradients (gradient accumulation), which offers elegant implementation at the\nslight (and worthwhile, in our opinion) trade-off of functional purity. Moreover, all closures and\nmutable variables in our approach can be allocated on the stack, which serves as an implicit data\nstructure for intermediate values. Other current approaches require at least some use of heap memory.\nHigher-order gradients can also be computed with our approach. One technical caveat is that\nsecond-order shift/reset operators are not available in Scala, thus we cannot naively nest our\ngradient computations, though it can be achieved in a different language which supports higher-order\nshift/reset. However, even in Scala, we can get second-order gradients (Hessians) by combining\nreverse-mode AD with forward-mode AD. We elide forward-mode AD in this paper, as it can be\neasily implemented by operator overloading in many languages. By applying forward-mode AD on\ntop of reverse-mode AD (changing the Double in the code snippets to a pair of Doubles, representing\nthe value and tangent, respectively), we can ef\ufb01ciently compute Hessians (or the Hessian vector dot\nproduct, for any given vector).\n\n3 Code Generation via Multi-Stage Programming\n\nVia delimited continuations, we get an expressive and de\ufb01ne-by-run framework, similar to PyTorch.\nHowever, TensorFlow and other de\ufb01ne-then-run frameworks bene\ufb01t from separating graph con-\nstruction and graph execution into two stages, so that graph transformations/optimizations can be\nperformed to target hardware-speci\ufb01c code (i.e., GPUs or TPUs). As such, we examine the possibility\nof utilizing this concept.\nA key insight in understanding how to adopt this paradigm is that TensorFlow graph construction\nis similar to a 30-year-old PL concept called multi-stage programming (staging, Taha and Sheard\n(2000)). A TensorFlow program can blend normal Python code with graph construction, just like the\nwell-established staging tool called Lightweight Modular Staging (LMS) (Rompf and Odersky, 2010)\ncan blend normal Scala program execution with IR construction (this IR (intermediate representation)\nis not executed, but rather used to generate code for the next stage).\n# graph construction\nimport tensorflow as tf\na = tf.constant(0)\nb = lambda i: tf.less(i, 10)\nc = lambda i: tf.add(i, 1)\nr = tf.while_loop(b, c, [i])\n\n// graph construction\nimport lms._\nval a: Rep[Float] = 0.0\nwhile (a < 10)\n\n// generated code\nfloat x0 = 0.0;\nwhile (x0 < 10) {\n\na += 1\nval r = a\n\nx0 += 1\n\n}\nfloat x1 = x0;\n\nFigure 5: TensorFlow graph construction (left), LMS IR construction (middle), and code generated\nfrom LMS (right).\nWe show a simple TensorFlow graph construction example and corresponding LMS code generation in\nFigure 5. Instead of tf.constant, LMS uses higher-order types (Rep[T]) to label IR constructions. All\nRep-typed values (and computations depending on Rep-typed values) are treated as IR and translated\ninto generated code, while all other typed values are treated as normal Scala expressions and are\n\u201cstaged away\u201d from the generated code. Relying on type inference and advanced operator overloading,\nLMS also extends to built-in control \ufb02ow constructs like if, for, and while, so that normal syntax\nwith subroutines and recursion can be used, in striking contrast to the clunky TensorFlow API. In fact,\nthe Rep types in LMS code are the only giveaways that any IR construction is taking place. We elide\nthe mechanisms of code generation in LMS, as they are not a contribution of this paper but covered\nin a substantial body of relevant publications (Rompf and Odersky, 2010; Rompf, 2012; Rompf et al.,\n2012; Rompf and Odersky, 2012; Kossakowski et al., 2012; Ackermann et al., 2012; Ofenbeck et al.,\n2013; Rompf et al., 2013, 2015; Rompf and Amin, 2015; Rompf, 2016a,b; Ofenbeck et al., 2017;\nAmin and Rompf, 2018; Stojanov et al., 2018; Tahboub et al., 2018; Essertel et al., 2018).\nAlthough we showcase LMS as the tool of staging, and shift/reset in Scala, it should be noted that\nthese two concepts are supported in other languages as well: our design is not con\ufb01ned to Scala.\nFor instance, shift/reset are common fare in certain dynamic languages in the Lisp/Scheme/Racket\ntradition, often implemented via stack-copying at runtime (Clinger et al., 1999). It would be very much\nfeasible to implement shift/reset in Python; the \u201cStackless Python\u201d1 dialect already provides similar\nfacilities. Efforts like AutoGraph (Moldovan et al., 2018) provide LMS-like staging mechanisms for\na subset of Python.\n\n1https://github.com/stackless-dev/stackless/wiki\n\n6\n\n\fThere are also several choices in how to combine delimited continuations and staging. A program can\nbe CPS transformed \ufb01rst, then staged to low-level languages. Otherwise, we can choose to stage the\nprogram to a medium-level language \ufb01rst (e.g., Scheme), do CPS transformation, and then compile it\nto low-level code (C/CUDA). Various degrees of engineering may be needed depending on the choice\nof languages and options, but no fundamental challenges should exist.\nWe choose to implement CPS-then-staging in Scala, merely out of convenience. With the requisite\nimplementations in place, we have established an expressive framework capable of supporting\nbranches, loops, and recursion, similar to the de\ufb01ne-by-run style of PyTorch. However, our approach\nis actually de\ufb01ne-then-run, which maintains a larger surface for analysis and optimization, like\nTensorFlow (but with in-graph functions and recursion). Aside from high-level optimizations among\ntensor operations that can be added in staging, our approach may bene\ufb01t from general compiler\noptimizations as well, since the program after CPS transformation is no different from normal\nprograms that are free of AD logic.\n\n4 Evaluation and Case Studies\n\nIn this section, we validate our design by implementing and evaluating our prototypic framework,\ndubbed Lantern2. Lantern builds on the code in earlier sections, but supports handling tensor objects\n(multi-dimension arrays with common linear algebra operations such as element-wise operations\nwith broadcasting, matrix multiplication, and convolution). The basic classes are shown below, with\nTensor relating to Double, and TensorR relating to NumR in earlier code snippets. Note that for each\nTensor, the data is Rep typed (as IR), but the shape is not (as it is known at staging time). Each TensorR\nobject contains a value x and a gradient d, and operations on TensorR are implemented with shift\noperators providing access to delimited continuations.\n\nclass Tensor(val data: Rep[Array[Double]], val shape: Array[Int]) {...}\nclass TensorR(val x: Tensor, val d: Tensor) {...}\n\nWhile some operations are linked to the OpenBLAS implementation, most operations are implemented\nas simple C++ loops. Even with such a naive backend implementation, Lantern demonstrates potential\nfor being both expressive and ef\ufb01cient, at least for some small/medium-sized models running on a\nsingle CPU, as shown by comparing with PyTorch, TensorFlow, and DyNet (Neubig et al., 2017). To\nbe complete, we plan to integrate with standard tensor compiler pipelines (e.g., XLA (TensorFlow\nteam, 2018; Distributed (Deep) Machine Learning Community, 2018)) or with purpose-built compiler\nframeworks that directly extend LMS (e.g., Delite and OptiML (Sujeeth et al., 2014, 2011)) as future\nwork.\n\n4.1 Evaluation of Four Common Deep Learning Architectures\n\nWe selected four representative machine learning architectures for our evaluations: a vanilla Recurrent\nNeural Network (RNN), Long Short-Term Memory (LSTM), TreeLSTM, and a Convolutional Neural\nNetwork (CNN). Sample implementations of these in TensorFlow or PyTorch are readily available\nonline, with either arti\ufb01cial or practical benchmarks. As a new deep learning framework that provides\nreverse-mode AD with a tensor API, our evaluation focuses on expressivity and ef\ufb01ciency, rather\nthan model generalization.3\nAs shown in Figure 6, we compared Lantern with TensorFlow and PyTorch (DyNet implementation\nwas only introduced for TreeLSTM for the bene\ufb01t of autobatching). The training loss (not shown) in\nall architectures had similar decay, indicating that Lantern correctly implements backward propagation.\nWe elected to only gauge the runtime of training loops, as that is the majority of computation. For\nvanilla RNN and LSTM, we evaluated at batch size 20. The training time for Lantern in both cases is\nless compared with that of PyTorch, and comparable to that of TensorFlow. For CNN, the evaluation\nwas done at batch size 100, and Lantern performed similarly with PyTorch and TensorFlow (compiled\nfrom source with Intel\u00ae Math Kernel Library for Deep Neural Networks (Intel\u00ae MKL-DNN)\nsupport).\n\n2https://github.com/feiwang3311/Lantern\n3All experiments were run using a single CPU on a cluster with Intel Xeon Platinum 8168 CPUs at 2.70GHz\n\nand 0.75 TB RAM per node.\n\n7\n\n\fFigure 6: Comparison of training times for vanilla RNN (top left), LSTM (top right), TreeLSTM\n(bottom left), and CNN (bottom right).\nWe would like to give extra attention to the evaluation of TreeLSTM, which is adapted from Sentiment\nClassi\ufb01cation using the dataset from the Stanford Sentiment Treebank (Chuang, 2013) following\nthe work of Tai et al. (2015). Brie\ufb02y, the model evaluates tree-structured parsed sentences (movie\nreviews) for sentiment (range 1 to 5).\n\nhi = TreeLSTM(Embedding(word), hi.left, hi.right)\n\nHere, hi is the hidden vector and the cell state (default when describing LSTM) associated with node\ni, and the Embedding is a large lookup table which maps each word to a 300-sized array, re\ufb02ecting the\nsemantic distances between all words in the vocabulary. TreeLSTM differs from a simple LSTM by\ntaking two previous states, from both the left and right children. For leaf nodes, the previous states\nare zero, as is the embedding for non-leaf nodes. The hidden vector from each node can be used to\ncompute a softmax of labels, thus generating a cross-entropy loss by comparing with the true label\nfor each node. By training, the total loss (or average loss per sentence) should be reduced; thus the\nTreeLSTM learns to evaluate reviews in a parse-tree format.\n\n// definition of loss function\ndef lossFun(root: Rep[Tree]) = {\n\nval init = (init_loss, init_hidden, init_cell)\ndef f = FUN { node: Rep[Tree] =>\nif (node.isEmpty) init else {\n\nval (left, right) = (f(node.left), f(node.right))\nLSTM_core(left, right)\n\n// return (new_loss, new_hidden, new_cell)\n\n}\n\n}\nval (outLoss, _, _) = f(root)\noutLoss\n\n}\n// gradient update loop\nfor (n <- (0 until maxIter): Rep[Range]) {\n\ngrad(lossFun(next_training_data()))\ngradient_descent()\n\n}\n\n// only return the loss\n\n// gradients are updated as side effects\n\nFigure 7: TreeLSTM implementation in Lantern. FUN emits a differentiable recursive function.\n\nThis model is worth examination due to the fact that TreeLSTM is a recursive model (the computation\ngraph is recursively and dynamically de\ufb01ned by the structure of the training data). In PyTorch4 and\n\n4https://github.com/ttpro1995/TreeLSTMSentiment\n\n8\n\n \fLantern, this model can be easily expressed as recursive functions (see Lantern implementation in\nFigure 7), but Lantern\u2019s implementation is more ef\ufb01cient (Figure 6). However, such dynamic models\nare very hard to batch: one cannot simply add another dimension for batching, since each training\ndatum may require a different computation graph. As such, both Lantern and PyTorch were run at\nbatch size 1. Both TensorFlow and DyNet have partial solutions to this challenge.\nTensorFlow cannot handle a recursive model easily, so the implementation5 used TensorFlow\nFold (Looks et al., 2017), a TensorFlow extension that statically modi\ufb01es the computation graph\nbased on the training data. Such a tool is more clunky and ad-hoc to use, but the bene\ufb01t is that it\nallows for effective batching, since a uni\ufb01ed static computation graph is constructed based on the\ntraining data. We evaluated TensorFold at batch size 20, and it indeed runs faster than PyTorch, but\nnot as fast as Lantern.\nIt is especially interesting to include another framework, DyNet, in this evaluation. DyNet is very\nsimilar to PyTorch, being a de\ufb01ne-by-run framework, but offers autobatching (Neubig et al., 2017),\nwhich dynamically batches similar computation nodes at runtime. We observed that DyNet without\nautobatching is somewhat slower than Lantern in runtime (labeled DyNetNB), but DyNet with\nautobatching has approximately a 40% speedup (labeled DyNetB) and ran approximately 20% faster\nthan Lantern. However, we still used batch size 1, so that only autobatching within each training\ndatum is enabled. Our tests show that larger batch sizes actually hurt performance, indicating that\nDyNet\u2019s autobatching heuristics may be improved, and that it is worthwhile to explore autobatching\noptions in Lantern as future work.\n\n5 Related Work and Concluding Remarks\n\nSeveral works from the PL community address the problem of differentiation. Karczmarczuk (2001)\npresented a functional implementation of differentiation using lazy evaluation that can compute\nin\ufb01nite towers of derivatives of higher order. Elliott (2009) developed an implementation of higher-\ndimensional, higher-order forward-mode automated differentiation (AD) by calculus on manifolds.\nHowever, in practice forward-mode AD has much higher complexity for machine learning models.\nThe seminal work by Siskind and Pearlmutter (2008) formalized forward- and reverse-mode AD in a\nfunctional framework. Several practical projects were developed based on their model, including a\n\ufb02exible differentiable functional programming library called DiffSharp (Baydin et al., 2016), and a\ndifferentiable library for natural language processing in Python called Thinc/spaCy6. Elliott (2018)\nprovided a generalization of AD based on category theory. However, the model as presented does not\ncover in-graph control \ufb02ow, thus limiting the range of application.\nThe ML community has also worked to bridge the gap between de\ufb01ne-by-run frameworks that are\neasy to use and de\ufb01ne-then-run frameworks that are ef\ufb01cient to run. Examples include Tangent (van\nMerrienboer et al., 2018), which provides AD in Python through the use of source-to- source\ntransformations, and Myia (Breuleux and van Merri\u00ebnboer, 2017), which implements a \ufb01rst- order\ngradient operator for a subset of Python (using a dedicated functional representation). Another line\nof work, AutoGraph (Moldovan et al., 2018), directly stages Python functions into an intermediate\nrepresentation and subsequently dispatches to different de\ufb01ne-then-run frameworks as back-ends\nincluding TensorFlow and Lantern.\nThe history of continuations is recounted nicely by Reynolds (1993).\nCompared with related work, our contribution stands out by applying two well-understood PL\nconcepts (delimited continuations and multi-stage programming) to reverse-mode AD, and arriving\nat a concise, expressive, and ef\ufb01cient backpropagation framework. The underlying ideas are agnostic\nto the choice of programming languages, thus having the potential to bene\ufb01t the ML community in\nbroad ways and regardless of implementation language.\n\nAcknowledgments\n\nThis work was supported in part by NSF awards 1553471 and 1564207, DOE award DE-SC0018050,\nand a Google Faculty Research Award.\n\n5https://github.com/tensorflow/fold/blob/master/tensorflow_fold/g3doc/sentiment.ipynb\n6https://github.com/explosion/thinc\n\n9\n\n\fReferences\nMart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S.\nCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew\nHarp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath\nKudlur, Josh Levenberg, Dandelion Man\u00e9, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah,\nMike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent\nVanhoucke, Vijay Vasudevan, Fernanda Vi\u00e9gas, Oriol Vinyals, Pete Warden, Martin Wattenberg,\nMartin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning\non Heterogeneous Systems. (2015). https://www.tensorflow.org/ Software available from\ntensor\ufb02ow.org.\n\nStefan Ackermann, Vojin Jovanovic, Tiark Rompf, and Martin Odersky. 2012. Jet: An Embedded\nDSL for High Performance Big Data Processing (BigData). http://infoscience.epfl.ch/\nrecord/181673/files/paper.pdf.\n\nNada Amin and Tiark Rompf. 2018. Collapsing towers of interpreters. PACMPL 2, POPL (2018),\n\n52:1\u201352:33.\n\nAndrew W. Appel. 1992. Compiling with Continuations. Cambridge University Press.\n\nAtilim G\u00fcnes Baydin, Barak A. Pearlmutter, and Jeffrey Mark Siskind. 2016. DiffSharp: An AD\n\nLibrary for .NET Languages. CoRR abs/1611.03423 (2016).\n\nOlivier Breuleux and Bart van Merri\u00ebnboer. 2017. Automatic Differentiation in Myia. In NIPS\n\nAutoDiff Workshop.\n\nJason Chuang. 2013. Stanford Sentiment Treebank.\n\n(2013).\n\nhttps://nlp.stanford.edu/\n\nsentiment/treebank.html\n\nWilliam D. Clinger, Anne Hartheimer, and Eric Ost. 1999. Implementation Strategies for First-Class\n\nContinuations. Higher-Order and Symbolic Computation 12, 1 (1999), 7\u201345.\n\nOlivier Danvy and Andrzej Filinski. 1990. Abstracting Control. In LISP and Functional Programming.\n\n151\u2013160.\n\nOlivier Danvy and Mayer Goldberg. 2005. There and back again. Fundamenta Informaticae 66, 4\n\n(2005), 397\u2013413.\n\nDistributed (Deep) Machine Learning Community. 2018. NNVM: Open Compiler for AI Frameworks.\n\n(2018). https://github.com/dmlc/nnvm\n\nConal Elliott. 2018. The Simple Essence of Automatic Differentiation. PACMPL 2, ICFP (2018).\n\nConal M. Elliott. 2009. Beautiful differentiation. In ICFP. ACM, 191\u2013202.\n\nGr\u00e9gory M. Essertel, Ruby Y. Tahboub, James M. Decker, Kevin J. Brown, Kunle Olukotun, and\nTiark Rompf. 2018. Flare: Optimizing Apache Spark with Native Compilation for Scale-Up\nArchitectures and Medium-Size Data. In OSDI. USENIX Association, 799\u2013815.\n\nMatthias Felleisen. 1988. The Theory and Practice of First-Class Prompts. In POPL. ACM Press,\n\n180\u2013190.\n\nMichael J. Fischer. 1972. Lambda Calculus Schemata. In Proceedings of ACM Conference on Proving\n\nAssertions About Programs. ACM, New York, NY, USA.\n\nMichael J. Fischer. 1993. Lambda-Calculus Schemata. Lisp and Symbolic Computation 6, 3-4 (1993),\n\n259\u2013288.\n\nUlrik J\u00f8rring and William L. Scherlis. 1986. Compilers and Staging Transformations. In POPL.\n\nACM Press, 86\u201396.\n\nJerzy Karczmarczuk. 2001. Functional Differentiation of Computer Programs. Higher-Order and\n\nSymbolic Computation 14, 1 (2001), 35\u201357.\n\n10\n\n\fGrzegorz Kossakowski, Nada Amin, Tiark Rompf, and Martin Odersky. 2012. JavaScript as an\n\nEmbedded DSL. In ECOOP. 409\u2013434.\n\nYann LeCun. 2018. Deep Learning est mort. Vive Differentiable Programming! https://www.\n\nfacebook.com/yann.lecun/posts/10155003011462143. (2018).\n\nMoshe Looks, Marcello Herreshoff, DeLesley Hutchins, and Peter Norvig. 2017. Deep Learning\n\nwith Dynamic Computation Graphs. ICLR (2017).\n\nDan Moldovan, James M Decker, Fei Wang, Andrew A Johnson, Brian K Lee, Zachary Nado, D\nSculley, Tiark Rompf, and Alexander B Wiltschko. 2018. AutoGraph: Imperative-style Coding\nwith Graph-based Performance. ArXiv e-prints (2018). arXiv:cs.MS/1810.08061\n\nGraham Neubig, Yoav Goldberg, and Chris Dyer. 2017. On-the-\ufb02y Operation Batching in Dynamic\n\nComputation Graphs. In NIPS. 3974\u20133984.\n\nGeorg Ofenbeck, Tiark Rompf, and Markus P\u00fcschel. 2017. Staging for generic programming in\n\nspace and time. In GPCE. ACM, 15\u201328.\n\nGeorg Ofenbeck, Tiark Rompf, Alen Stojanov, Martin Odersky, and Markus P\u00fcschel. 2013. Spiral\nin scala: towards the systematic construction of generators for performance libraries. In GPCE.\nACM, 125\u2013134.\n\nChristopher Olah. 2015. Neural Networks, Types, and Functional Programming. http://colah.\n\ngithub.io/posts/2015-09-NN-Types-FP/. (2015).\n\nAdam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. 2017. PyTorch: Tensors and\n\ndynamic neural networks in Python with strong GPU acceleration. (2017). www.pytorch.org\n\nBarak A. Pearlmutter and Jeffrey Mark Siskind. 2008. Reverse-mode AD in a functional framework:\nLambda the ultimate backpropagator. ACM Trans. Program. Lang. Syst. 30, 2 (2008), 7:1\u20137:36.\n\nJohn C. Reynolds. 1993. The Discoveries of Continuations. Lisp and Symbolic Computation 6, 3-4\n\n(1993), 233\u2013248.\n\nTiark Rompf. 2012. Lightweight Modular Staging and Embedded Compilers: Abstraction Without\n\nRegret for High-Level High-Performance Programming. Ph.D. Dissertation. EPFL.\n\nTiark Rompf. 2016a. The Essence of Multi-stage Evaluation in LMS. In A List of Successes That\n\nCan Change the World (Lecture Notes in Computer Science), Vol. 9600. Springer, 318\u2013335.\n\nTiark Rompf. 2016b. Re\ufb02ections on LMS: exploring front-end alternatives. In Scala Symposium.\n\nACM, 41\u201350.\n\nTiark Rompf and Nada Amin. 2015. Functional pearl: a SQL to C compiler in 500 lines of code. In\n\nICFP. ACM, 2\u20139.\n\nTiark Rompf, Nada Amin, Adriaan Moors, Philipp Haller, and Martin Odersky. 2012. Scala-\nVirtualized: linguistic reuse for deep embeddings. Higher-Order and Symbolic Computation 25, 1\n(2012), 165\u2013207.\n\nTiark Rompf, Kevin J. Brown, HyoukJoong Lee, Arvind K. Sujeeth, Manohar Jonnalagedda, Nada\nAmin, Georg Ofenbeck, Alen Stojanov, Yannis Klonatos, Mohammad Dashti, Christoph Koch,\nMarkus P\u00fcschel, and Kunle Olukotun. 2015. Go Meta! A Case for Generative Programming\nand DSLs in Performance Critical Systems. In SNAPL (LIPIcs), Vol. 32. Schloss Dagstuhl -\nLeibniz-Zentrum fuer Informatik, 238\u2013261.\n\nTiark Rompf, Ingo Maier, and Martin Odersky. 2009. Implementing \ufb01rst-class polymorphic delimited\n\ncontinuations by a type-directed selective CPS-transform. In ICFP. ACM, 317\u2013328.\n\nTiark Rompf and Martin Odersky. 2010. Lightweight modular staging: a pragmatic approach to\n\nruntime code generation and compiled DSLs. In GPCE. ACM, 127\u2013136.\n\nTiark Rompf and Martin Odersky. 2012. Lightweight modular staging: a pragmatic approach to\n\nruntime code generation and compiled DSLs. Commun. ACM 55, 6 (2012), 121\u2013130.\n\n11\n\n\fTiark Rompf, Arvind K. Sujeeth, Nada Amin, Kevin Brown, Vojin Jovanovic, HyoukJoong Lee,\nManohar Jonnalagedda, Kunle Olukotun, and Martin Odersky. 2013. Optimizing Data Structures\nin High-Level Programs (POPL).\n\nDavid E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1986. Learning representations by\n\nback-propagating errors. Nature 323, 6088 (1986), 533.\n\nJeffrey Mark Siskind and Barak A. Pearlmutter. 2008. Nesting forward-mode AD in a functional\n\nframework. Higher-Order and Symbolic Computation 21, 4 (2008), 361\u2013376.\n\nBert Speelpenning. 1980. Compiling fast partial derivatives of functions given by algorithms. Ph.D.\n\nDissertation.\n\nAlen Stojanov, Ivaylo Toskov, Tiark Rompf, and Markus P\u00fcschel. 2018. SIMD intrinsics on managed\n\nlanguage runtimes. In CGO. ACM, 2\u201315.\n\nArvind K. Sujeeth, Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Hassan Cha\ufb01, Martin Odersky,\nand Kunle Olukotun. 2014. Delite: A Compiler Architecture for Performance-Oriented Embedded\nDomain-Speci\ufb01c Languages. ACM Trans. Embedded Comput. Syst. 13, 4s (2014), 134:1\u2013134:25.\n\nArvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown, Tiark Rompf, Hassan Cha\ufb01, Michael Wu,\nAnand R. Atreya, Martin Odersky, and Kunle Olukotun. 2011. OptiML: An Implicitly Parallel\nDomain-Speci\ufb01c Language for Machine Learning. In ICML. Omnipress, 609\u2013616.\n\nWalid Taha and Tim Sheard. 2000. MetaML and multi-stage programming with explicit annotations.\n\nTheor. Comput. Sci. 248, 1-2 (2000), 211\u2013242.\n\nRuby Y. Tahboub, Gr\u00e9gory M. Essertel, and Tiark Rompf. 2018. How to Architect a Query Compiler,\n\nRevisited. In SIGMOD Conference. ACM, 307\u2013322.\n\nKai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved Semantic Representa-\ntions From Tree-Structured Long Short-Term Memory Networks. In ACL (1). The Association for\nComputer Linguistics, 1556\u20131566.\n\nTensorFlow team. 2018. XLA Overview. (2018). https://www.tensorflow.org/performance/\n\nxla/\n\nBart van Merrienboer, Dan Moldovan, and Alexander Wiltschko. 2018. Tangent: Automatic differen-\n\ntiation using source-code transformation for dynamically typed array programming. In NIPS.\n\nAdriaan van Wijngaarden. 1966. Recursive de\ufb01nition of syntax and semantics. Formal Language\n\nDescription Languages for Computer Programming (1966), 13\u201324.\n\nFei Wang and Tiark Rompf. 2018. A Language and Compiler View on Differentiable Programming.\n\nICLR Workshop Track (2018). https://openreview.net/forum?id=SJxJtYkPG\n\nR. E. Wengert. 1964. A simple automatic derivative evaluation program. Commun. ACM 7, 8 (1964),\n\n463\u2013464.\n\n12\n\n\f", "award": [], "sourceid": 6537, "authors": [{"given_name": "Fei", "family_name": "Wang", "institution": "Purdue University"}, {"given_name": "James", "family_name": "Decker", "institution": "Purdue University"}, {"given_name": "Xilun", "family_name": "Wu", "institution": "Purdue University"}, {"given_name": "Gregory", "family_name": "Essertel", "institution": "Purdue University"}, {"given_name": "Tiark", "family_name": "Rompf", "institution": "Purdue University"}]}