{"title": "Tangent: Automatic differentiation using source-code transformation for dynamically typed array programming", "book": "Advances in Neural Information Processing Systems", "page_first": 6256, "page_last": 6265, "abstract": "The need to efficiently calculate first- and higher-order derivatives of increasingly complex models expressed in Python has stressed or exceeded the capabilities of available tools. In this work, we explore techniques from the field of automatic differentiation (AD) that can give researchers expressive power, performance and strong usability. These include source-code transformation (SCT), flexible gradient surgery, efficient in-place array operations, and higher-order derivatives. We implement and demonstrate these ideas in the Tangent software library for Python, the first AD framework for a dynamic language that uses SCT.", "full_text": "Tangent: Automatic differentiation using source-code\n\ntransformation for dynamically typed array\n\nprogramming\n\nBart van Merri\u00ebnboer\nMILA, Google Brain\nbartvm@google.com\n\nDan Moldovan\nGoogle Brain\n\nmdan@google.com\n\nAbstract\n\nAlexander B Wiltschko\n\nGoogle Brain\n\nalexbw@google.com\n\nThe need to ef\ufb01ciently calculate \ufb01rst- and higher-order derivatives of increasingly\ncomplex models expressed in Python has stressed or exceeded the capabilities of\navailable tools. In this work, we explore techniques from the \ufb01eld of automatic\ndifferentiation (AD) that can give researchers expressive power, performance and\nstrong usability. These include source-code transformation (SCT), \ufb02exible gradient\nsurgery, ef\ufb01cient in-place array operations, and higher-order derivatives. We\nimplement and demonstrate these ideas in the Tangent software library for Python,\nthe \ufb01rst AD framework for a dynamic language that uses SCT.\n\n1\n\nIntroduction\n\nMany applications in machine learning rely on gradient-based optimization, or at least the ef\ufb01cient\ncalculation of derivatives of models expressed as computer programs. Researchers have a wide\nvariety of tools from which they can choose, particularly if they are using the Python language\n[21, 16, 24, 2, 1]. These tools can generally be characterized as trading off research or production use\ncases, and can be divided along these lines by whether they implement automatic differentiation using\noperator overloading (OO) or SCT. SCT affords more opportunities for whole-program optimization,\nwhile OO makes it easier to support convenient syntax in Python, like data-dependent control \ufb02ow, or\nadvanced features such as custom partial derivatives. We show here that it is possible to offer the\nprogramming \ufb02exibility usually thought to be exclusive to OO-based tools in an SCT framework.\nTangent is the \ufb01rst AD framework using SCT in a dynamically typed language. We produce ef\ufb01cient\nderivatives using a novel combination of multiple dispatch, lazy evaluation, and static optimizations.\nFurther, Tangent has mutable multidimensional arrays as \ufb01rst class objects, implemented using\npersistent data structures for performance in the context of reverse mode AD. By operating directly\non Python source code, Tangent is able to achieve a separation of concerns that other AD libraries do\nnot. Speci\ufb01cally, we achieve compositionality with tools in the Python ecosystem, such as debuggers,\npro\ufb01lers and other compilers. Tangent makes it easy and ef\ufb01cient to express machine learning models,\nand is open source 1.\n\n2 Background\n\nAutomatic differentiation (AD) is a set of techniques to evaluate derivatives of mathematical functions\nde\ufb01ned as programs [10], and is heavily used in machine learning [3]. It is based on the insight that\nthe chain rule can be applied to the elementary arithmetic operations (primitives) performed by the\nprogram. This allows derivatives to be calculated up to machine precision [17] with only a constant\n\n1Source code and documentation available at https://github.com/google/tangent\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\foverhead per operation. AD is different from symbolic differentiation (which applies to mathematical\nexpressions instead of programs) and numerical differentiation (where the gradient is approximated\nusing \ufb01nite differences).\nFor multidimensional functions, f : Rn \u2192 Rm, where f is a composition of primitives with known\nderivatives, the application of the chain rule results in a series of matrix-vector multiplications\ninvolving the primitives\u2019 Jacobians and partial derivatives of intermediate values. The order in which\nthese multiplications are evaluated determines the runtime complexity. Forward-mode AD evaluates\nthe chain rule from inside to outside and is ef\ufb01cient for functions where m > n. The implementation\nof forward mode is relatively straightforward, since the partial derivatives are evaluated in step with\nthe primitives. Forward mode is commonly implemented by replacing numbers with dual numbers,\nwhich can be interpreted as a variable\u2019s value along with its partial derivative with respect to one\nof the inputs. Reverse-mode AD, where the chain rule is evaluated from outside to inside, is more\nef\ufb01cient in the case where n > m. Reverse mode is more complex to implement because evaluation\nof the partial derivatives requires reversing the execution order of the original program. This reversal\ngives rise to a non-local program transformation where the beginning of the original program interacts\nwith the generated derivative program.\nTwo methods of implementing reverse-mode AD are commonly distinguished: operator overloading\n(OO) and source code transformation (SCT). In the OO approach primitives are overloaded so that\nat runtime each numerical operation is logged onto a tape (a linear trace) along with its inputs. The\nchain rule can then be evaluated by walking this tape backward. SCT, on the other hand, explicitly\ntransforms the original program (primal) prior to execution to produce a separate derivative function\n(adjoint) whose control \ufb02ow is the reverse of the original program. Both approaches have different\nimplementation, performance, and usability trade-offs [5].\nOO is easier to implement and since it only requires tracing, it naturally supports all the features of\nthe host language such as higher-order functions, recursion, and classes. If the control \ufb02ow of the\nprogram is data dependent, the function must be retraced for each function call, which can cause\nsigni\ufb01cant overhead when the runtime of the primitives is small compared to the cost of tracing.\nSince the adjoint program is run by a separate \u2018derivative interpreter\u2019 (the algorithm that walks the\ntape in reverse), there is no adjoint program that can be inspected, optimized or compiled.\nSCT is harder to implement, since it requires tooling to transform intermediate representations of\ncomputer programs. Further, the AD tool must explicitly support all of the features of the host\nlanguage, including function calls, loops, classes, etc. If a language feature is not explicitly handled\nby the AD system, the user cannot take derivatives of code using those features. For some languages\nlike C and C++ this requires a separate toolchain, but re\ufb02ective languages such as Lisp and Python\ncontain the necessary tools to capture, transform, and output program representations. The advantage\nof SCT is that there is no runtime overhead, and that generated derivative code can be statically\nanalyzed and optimized.\n\n3 Prior work\n\nAD packages using either approach have long existed for, e.g., C, C++, Fortran, (see [3] for an\noverview) and have been used in \ufb01elds such as computational \ufb02uid dynamics, atmospheric sciences,\nand astronomy. In the machine learning community different needs have led to the development of a\nseparate set of tools. In particular, the community has a strong attachment to Python and its models\nrely heavily on multidimensional arrays.\nTheano [2] and TensorFlow [1] are two popular machine learning frameworks with support for SCT\nAD. Although Python-based, they do not perform AD on the Python code. Instead, Python is used\nas a metaprogramming language to de\ufb01ne a data\ufb02ow graph (computation graph) on which SCT is\nperformed. Since these data\ufb02ow graphs only operate on immutable values and do not have function\ncalls or lexical scoping, the AD logic is simpli\ufb01ed. The same graph representation is then used for\nstatic analysis, optimizations, and code generation.\nOO has been used to implement AD in Python in packages such as Autograd [16], Chainer [24], and\nPyTorch [21].\nAlthough OO frameworks are easier to implement, their runtime performance falls short of that of\nframeworks using SCT for workloads that do not spend most of their time in hand-optimized compute\n\n2\n\n\fprimitives. On the other hand, existing frameworks that use SCT require the user to metaprogram\ncomputation graphs, signi\ufb01cantly complicating the de\ufb01nition of ML models. Tangent applies SCT\ndirectly on the Python language in order to combine the performance achieved by SCT with the\nusability of programming directly in Python.\n\n4 Features\n\nTangent supports reverse mode and forward mode, as well as function calls, loops, and conditionals.\nHigher-order derivatives are supported, and reverse and forward mode can readily be combined. To\nour knowledge, Tangent is the \ufb01rst SCT-based AD system for Python and moreover, it is the \ufb01rst\nSCT-based AD system for a dynamically typed language. As a consequence of performing SCT\ndirectly on the Python source code, the generated programs can be run, inspected, pro\ufb01led, and\ndebugged with standard Python tools. Tangent supports array programming on both CPU and GPU\nthrough the NumPy [19] and TensorFlow Eager libraries. A modular design makes it possible to\nextend Tangent to support other numeric libraries.\nThe ability to write code directly in Python makes Tangent less verbose and more idiomatic than\nthe metaprogramming approach used by Theano and Tensor\ufb02ow (see Listing 1a). Moreover, the\nmetaprogrammed code requires a separate compiler and/or runtime, separate debugging tools, etc.\n\nx = tf.placeholder(tf.float32)\ny = x * x\ndx, = tf.gradients(y, x)\n\nwith tf.Session() as sess:\n\ndx_ = sess.run(dx, feed_dict={x: 3})\n\n(a) TensorFlow requires the programmer to de\ufb01ne the\nvariable x as part of the data\ufb02ow graph. After the pro-\ngram (data\ufb02ow graph) has been constructed, its evalua-\ntion must be triggered by creating a session and provid-\ning values for the arguments.\n\ndef f(x):\n\nreturn x * x\n\ndf = grad(f)\ndx = df(3)\n\n(b) Tangent and libraries such as Autograd allow the\nuser to write pure Python.\n\nListing 1: Comparison between metaprogramming and direct programming approaches.\n\nThe OO approach can be problematic for debugging and usability as well as performance (see\nListing 2). When an adjoint function grad(f) is called, the function f is executed with non-standard\nsemantics, since each function and operator has been overloaded to log onto a tape, after which the\ntape is walked in reverse using a loop that is internal to the framework. This means that each function\ncall incurs tracing overhead, and errors that occur during execution will potentially have tracebacks\ninvolving tracing logic that can be hard for a user to decipher.\n\ndef f(x):\n\nwhile x < 10000:\n\nx = x + 1\n\nreturn x\n\nListing 2: In the case that x is a scalar, this\ntrivial program and its derivative contain a tight\nloop. Since it does not require tracing, Tan-\ngent\u2019s derivative of this function is approxi-\nmately 30% faster than PyTorch\u2019s, even though\nPyTorch is given type information about x\nwhereas Tangent\u2019s derivative is dynamically\ntyped.\n\n# Generated gradient function\ndef dfdx(x, by=1.0):\n\n# Grad of: y = x * x\n_bx = tangent.unbroadcast(by * x, x)\n_bx2 = tangent.unbroadcast(by * x, x)\nbx = _bx\nbx = tangent.add_grad(bx, _bx2)\nreturn bx\n\nListing 3: Source code of the gradient of\ndef f(x): return x * x in Tangent. The\nunbroadcast function is responsible for re-\nversing the broadcasting performed by NumPy\nwhen performing element-wise operations on\ndifferently-sized multidimensional arrays.\n\nThe adjoint code generated by Tangent is regular Python (see Listing 3), which means that it can be\ndebugged using standard debuggers such as pdb, pro\ufb01led using, e.g., line_profiler, optimized\nby JIT compilers such as Numba [14] and Pythran [11]. The adjoint code can readily be inspected\n\n3\n\n\fby users, and Tangent tries to ensure that is human-readable and commented, which is useful for\ndebugging as well as for didactic purposes.\nUnlike most existing ML frameworks, arrays in Tangent are mutable without incurring unnecessary\nperformance loss (see Section 5.4 for implementation details).\n\n4.1 Backward pass inlining\n\nMany algorithms use approximations or modi\ufb01cations of the gradient. For example, for performance\nreasons recurrent neural networks (RNNs) are often trained using truncated backpropagation through\ntime [26] (TBPTT) and/or gradient clipping [20]. In other cases, custom gradients are used to train\nmodels with discontinuous functions (e.g. straight-through estimators) or for many other applications\n[4, 9, 25, 12, 13, 18, 15]. A user might also be interested in accessing the values of gradients for\nlogging or debugging.\nExisting AD frameworks support this functionality by allowing the user to de\ufb01ne custom adjoints\nfor functions. Tangent provides this functionality as well, but uses Python\u2019s context manager syntax\nto introduce a second, novel way of allowing the user to inject arbitrary code into the gradient\ncomputation (see Listing 4). We believe this syntax provides a more succinct and readable way of\nmodifying the adjoint code in many cases.\n\n# Original function\ndef f(x):\n\nwith insert_grad_of(x) as dx:\n\nif dx > 10:\n\nprint('Clipping', dx)\ndx = 10\nreturn x * x\n\n# Generated gradient function\ndef dfdx(x, bx_times_x=1.0):\n\nx_times_x = x * x\n# Grad of: dx = 10\n_bx = tangent.unbroadcast(bx_times_x * x, x)\n_bx2 = tangent.unbroadcast(bx_times_x * x, x)\nbx = _bx\nbx = tangent.add_grad(bx, _bx2)\n# Inserted code\nif bx > 10:\n\nprint('Clipping', bx)\nbx = 10\nreturn bx\n\nListing 4: Gradient clipping implemented using Tangent. The code inside of the context manager is\ninserted directly into the derivative function.\n\n5\n\nImplementation\n\nTangent uses Python\u2019s built-in machinery to inspect and transform the abstract syntax tree (AST) of\nparsed source code. AD can be performed line by line [10, Proposition 4.2]. Hence, for each piece of\nsupported Python syntax we have implemented a rule indicating how to rewrite an AST node into\nits primal and adjoint. We have de\ufb01ned adjoints for e.g. mathematical operators, function calls to\nNumPy methods, and constructs such as if-statements and for-loops. The adjoints are de\ufb01ned using a\ncustom template programming syntax (see Listing 5) which makes it easy for users to add new or\ncustom derivatives.\n\n# Templates are Python functions\n@adjoint(numpy.multiply)\ndef adjoint_multiply(z, x, y):\n\nd[x] = y * d[z]\nd[y] = x * d[z]\n\n# If the primal contains...\nc = numpy.multiply(a, b)\n\n# ...Tangent will expand the template...\nnew_ast = tangent.template.replace(\n\nadjoint_multiply,\nz='c', x='a', y='b')\n\n# ...generating the following adjoint\nb_a = b * b_c\nb_b = a * b_c\n\nListing 5: Tangent\u2019s source generation uses templating. The template takes the form of a Python\nfunction which is parsed into its AST. The variable names in the AST are substituted and variables for\nthe partial derivatives are constructed, before the AST is inserted into the code of the adjoint function.\n\nGenerated derivative code is constructed using the built-in Python AST. The alternative program\nrepresentations are Python bytecode, which changes across Python versions, and a formatting-aware\n\n4\n\n\fAST used in the Python 2-to-3 conversion tool, 2to3, which has little tooling and is more cumbersome\nto use. We acquire and manipulate the Python AST with the inspect and ast modules from the\nstandard library, and standardize small differences between the Python 2 and Python 3 AST with\ngast and use astor to invert ASTs into readable source code.\nTo support dynamic typing and array programming while maintaining ef\ufb01ciency, Tangent relies\non a novel combination of multiple dispatch, lazy evaluation, persistent data structures, and static\noptimizations.\n\n5.1 Multiple dispatch\n\nPython is a dynamic language which uses dynamic typing, late binding and operator overloading.\nThese fundamental features of the language make it impossible to determine ahead of time how a\nstatement will be executed, which means it is impossible to determine ahead of time what the adjoint\nprogram should be. Instead of enforcing static types (for example by using type annotations and\nMyPy2), Tangent embraces late binding and generates adjoints that will use the runtime types to\ndetermine what derivative computation to execute.\nFor example, x * y where x and y are scalars at runtime results in a scalar multiplication. However,\nif either of the two variables is a NumPy ndarray object, the multiplication operator is dispatched to\nperform broadcasting followed by element-wise multiplication. The adjoint of this operation requires\nsumming over the broadcasted axes. Tangent will generate code that uses type checking to ensure\nthat the correct adjoint calculation is performed based on the runtime types.\nSimilarly, the initialization and addition of gradients cannot be generated statically. We introduce\nadd_grad and init_grad operators which use multiple dispatch. For example, init_grad(x)\nwill return 0 if x is a scalar, but will return numpy.zeros_like(x) if x is an ndarray.\n\n5.2 Lazy evaluation\n\nA common performance bottleneck in the context of AD and array programming is that initializing\nthe gradient of a large array results in allocating a large zero array. When gradients are accumulated\nlater on this large array of zeros is added to a partial gradient, which is effectively a no-op. In general,\nthe gradient initialization and addition might happen in different functions, making it non-trivial to\nstatically optimize this case. To address this issue, Tangent lazily initializes gradients: Instead of\nallocating an array of zeros, Tangent returns a special ZeroGrad object. The add_grad operator\nuses multiple dispatch to return the other argument when either argument is of the type ZeroGrad.\n\n5.3 Static optimizations\n\nWhen constructing the adjoint of a function, some of the code of the forward pass might become dead\ncode. The opportunity for removing unused code only grows when taking higher order derivatives.\nOne of the advantages of SCT is that the resulting code can be optimized by an optimizing compiler\nwhose dead code elimination (DCE) pass would address this problem. However, Python is an\ninterpreted language, and very few optimizations are applied before its execution. For this reason,\nTangent includes a small Python optimizing compiler toolchain which constructs a control-\ufb02ow graph\n(CFG) on which forward data\ufb02ow analysis is performed. Tangent uses this to perform dead code\nelimination on generated adjoints. The same machinery is used to perform algebraic simpli\ufb01cations\nand constant propagation. Note that although these optimizations are hard to perform on Python in\ngeneral, we can exploit the fact that Tangent operates on a more limited subset of Python which is\nmore amenable to analysis (see Section 6 for details).\nNote that these optimizations are aimed at removing dead code or simplifying trivial expressions\n(such as multiplication by 1) generated by the AD algorithm. Unlike frameworks such as XLA and\nTVM [7], we expressly do not attempt to optimize the numerical kernels themselves. Since Tangent\noutputs regular Python code, functions can be passed to an optimizing Python compiler such as\nNumba for this purpose.\nA central problem in reverse mode AD is that intermediate values are required to be kept alive after\nthey go out of scope since they might be needed by their adjoint. For example, if a function contains\n\n2http://mypy-lang.org/\n\n5\n\n\f# Optimized generated code\ndef dfdx(x, by=1.0):\n\ny = x\n# Grad of: y = x\n_bx = tangent.copy(by)\nbx = _bx\nreturn bx\n\n# Raw generated code\ndef dfdx(x, by=1.0):\n\n# Initialize the tape\n_stack = tangent.Stack()\ny = None\n# Beginning of forward pass\ntangent.push(_stack, y, '_19429e9f')\ny = x\n# Beginning of backward pass\n_y = y\n# Grad of: y = x\ny = tangent.pop(_stack, '_19429e9f')\n_bx = tangent.copy(by)\nby = tangent.init_grad(y)\nbx = _bx\nreturn bx\n\nListing 6: A simple example of Tangent\u2019s optimization capabilities as applied to the gradient function\nof def f(x): y = x; return y. Note that the original transformation includes the writing and\nreading of y to and from the tape, and contains dead code in initializing the gradient of y which is\nnever returned. Tangent\u2019s data\ufb02ow analysis is able to match the tape reads and writes and understands\nthat the value of y is the same, allowing it to aggressively optimize the function.\n\nz = x * y the variables x and y cannot be deleted after the function returns since the backward\npass requires their values to calculate dx = dz * y and dy = dz * x. Tangent, like most SCT\nframeworks, uses a global stack (tape) to store intermediate variables on in order to ensure they\nare kept alive. Hence, before the function returns, x and y are pushed onto this stack and they will\nbe popped off the stack right before the adjoint calculation. Note that the trace used in OO is also\nreferred to as a tape, the difference being that the tape in OO stores not only the intermediate variables,\nbut also the operations performed.\nIn order to perform DCE effectively on the generated code, our data\ufb02ow analysis follows variables\nuses through their respective pushes (reads) and pops (writes) in the primal and adjoint code. This\nhighlights the close interaction required between the optimizing compiler and the AD machinery\nfor maximum performance. To enable the data\ufb02ow analysis to match reads and writes they are\naugmented in the source code with unique hashes (see Listing 6).\n\n5.4 Persistent data structures\n\n# Create handle to original version\nx_copy = copy.copy(x)\n# Create new node y\nx[i] = v\n# Restore original version\nprint(x_copy)\n# Modify old version to create z\nx_copy[i] = v\n\nListing 7: Illustration of the persistent array data structure. Root nodes are gray and edges represent\ndeltas.\n\nAD is problematic in the context of mutability. If x and y from the previous example are mutable\narrays, their value could have been changed by an in-place operation, resulting in an incorrect adjoint\ncalculation. For this reason, arrays are in principle immutable in existing AD frameworks for ML such\nas TensorFlow, Autograd, and Theano. PyTorch allows users to mutate arrays if they can guarantee\nthat the previous version will not be needed by the backward pass, otherwise an error will be thrown.\nThis makes algorithms which rely on mutating arrays in place inef\ufb01cient and dif\ufb01cult to express.\nPersistent data structures [8] are data structures that are effectively immutable: They are mutable\ndata structures where all previous versions can be accessed. Unlike truly immutable data structures,\n\n6\n\nxxyxyxzy\fdifferent versions of persistent data structures may share memory and hence can be more memory-\nef\ufb01cient, although accessing previous versions might require extra work. Functional languages often\nuse persistent data structures for implementing, e.g., linked lists, trees, stacks. We note that persistent\narray data structures can be used to support mutability ef\ufb01ciently in the context of AD.\nBy default, Tangent handles index assignments (x[i] = y) ef\ufb01ciently by copying only the affected\nsubarray x[i] onto the tape. To deal with mutability in full generality Tangent also introduces a\npersistent array data structure with support for index assignment as well as inserting and deleting\nrows at the end. Each time the array is modi\ufb01ed, the delta with the previous version is stored. Since\nprevious versions can be modi\ufb01ed as well, this results in a version tree where the root contains the\ncurrent array in memory and other versions of the array are represented by leaf nodes (see Listing 7).\nIf the user attempts to read a speci\ufb01c version of the array, the deltas on the path from the leaf to the\nroot of the version tree are applied in order to reconstruct the array in memory. When the handle to a\nspeci\ufb01c array version is destroyed, the deltas are garbage collected. We note that in the context of\nreverse mode AD the most common access pattern is a linear series of mutations during the forward\npass, followed by accessing the arrays in reverse order during the backward pass. In this case, our\npersistent array results in optimal memory and time complexity.\nAs an example, consider the double loop from Listing 8, which is a simpli\ufb01cation of a neural lattice\nlanguage model from [6]. Given an outer loop with n iterations, an inner loop with m iterations, and a\nvector dimensionality of d, the complexity of this algorithm is O(n2md) for immutable arrays. When\nusing regular NumPy arrays, Tangent will intelligently handle index assignments and only copy the\naffected subarray onto the tape, bringing the complexity down to O(n2d + ndm). When a persistent\narray is used, the complexity goes down to O(ndm). When using persistent arrays, Tangent\u2019s runtime\nand memory complexity is determined only by the amount of data that is inserted, deleted or modi\ufb01ed.\nIn contrast, most libraries will have the gradient\u2019s runtime and memory complexity grow linearly with\nthe number of times an array is modi\ufb01ed. The technique described in [22] for memory-augmented\nnetworks is also a special case of using persistent arrays.\n\n8\n\n6\n\n4\n\n2\n\n)\ns\n(\n\ne\nm\n\ni\nt\nn\nu\nR\n\n0\n100\n\nImmutable arrays\nPersistent array\nStore subarray\n\n300\nOuter loop length (iterations)\n\n500\n\n700\n\n900\n\ndef f(x, OUTER):\n\nr = numpy.zeros(DIM)\nfor _ in range(OUTER):\n\nx = append(x, r)\nfor _ in range(INNER):\n\ny = numpy.add(x[-1], 1.)\nx = setitem(x, -1, y)\n\nreturn numpy.mean(x)\n\nListing 8: Runtime for a simpli\ufb01ed version of a lattice language model with dimension 2000 and\ninner loop of 15 iterations. Results are an average of 10 runs.\n\n6 Limitations\n\nSCT relies on the ability to perform data\ufb02ow analysis to determine which variables are \u2018active\u2019\ni.e. which variables affect the output of the function whose derivative we are constructing. To this\nend, Tangent is restricted to a subset of Python where these analyses are feasible. Note that these\nrestrictions only apply to statements involving active variables.\n\n1. Functions\n\nthat modify a variable\n\nalso return that variable.\nHence,\nnumpy.add(a, b, out=a) is disallowed and should be written as\na = numpy.add(a, b). Likewise, a user-de\ufb01ned function that modi\ufb01es x in-place using\nx[i] = v, must have x as a returned value.\n\nin-place must\n\n2. Closures are not supported since closures with free variable references lead to a problem\nsometimes referred to as \u2018perturbation confusion\u2019 [23], which is non-trivial to address.\nAdditionally, Python uses lexical, not dynamic scoping, so writing adjoint values into the\nsame scope where primal values are read is not straightforward.\n\n7\n\n\f3. Object methods are not currently supported because it is non-obvious what the partial\n\nderivative with respect to a member variable is.\n\n4. In order to perform AD, the function and its source code must be resolvable at the time\nthat the AD transformation is applied. This means that higher-order functions and nested\nfunction de\ufb01nitions are not supported. Tangent could apply additional AD passes at runtime\nto avoid this limitation.\n\n5. Some Python syntax is not (yet) supported3 e.g. try and except statements, as well as\n\nbreak and continue.\n\nIf the return value of a function is not used, it is assumed that its inputs were unchanged. This\nallows statements such as print(numpy.mean(x)) to be used without interfering with the AD\ntransformation.\n\n7 Performance\n\nTangent was not designed with raw performance in mind. Instead, it intends to strike a balance\nbetween usability and good software design practices, while exploring the feasibility and implemen-\ntation details of applying SCT to dynamically typed languages. That said, Tangent\u2019s lack of runtime\noverhead combined with static optimizations and lazy gradient initialization means that its runtime\nperformance is competitive with existing frameworks (see Listing 9).\n\nTangent\nAutograd\nTensorFlow\n\ndef logsumexp(x):\n\nreturn numpy.log(numpy.sum(numpy.exp(x),\naxis=-1, keepdims=True))\n\ndef logsoftmax(logits):\n\nreturn logits - logsumexp(logits)\n\ndef softmax_xent(logits, y):\n\nreturn -numpy.sum(\n\nlogsoftmax(logits) * y, axis=-1)\n\ndef mlp(x, w1, b1, wout, bout, label):\n\nh1 = numpy.tanh(numpy.dot(x, w1) + b1)\nout = numpy.dot(h1, wout) + bout\nloss = numpy.mean(softmax_xent(out,label))\nreturn loss\n\nautograd_dmlp = autograd.multigrad(\n\nmlp, argnums=(1, 2, 3, 4))\n\ntangent_dmlp = tangent.grad(\n\nmlp, wrt=(1, 2, 3, 4))\n\n)\n1\n-\ns\n(\n\nh\nc\nt\na\nb\nr\ne\np\n\ns\nd\nn\no\nc\ne\nS\n\n10\u22121\n\n10\u22122\n\n10\u22123\n\n10\u22124\n\n26\n\n27\n\n28\n\n29 210 211 212 213\n\nModel size\n\nListing 9: Runtime for a simple feedforward neural network with a single hidden layer. We vary the\ninput size and hidden layer size, which are set to the same value. The reported runtime is averaged\nover 50 runs with a batch size of 16. Run on a Xeon E5-1650 v3 @ 3.5 GHz, 64GB of RAM, with\nUbuntu 14.04 on Python 2.7 with MKL. Note that for suf\ufb01ciently large models the runtime of the\nnumerical kernels dominates, which means that the frameworks have similar runtimes irrespective of\ntheir AD implementation.\n\n8 Conclusion\n\nIn this work we introduced the AD library Tangent. Tangent is the \ufb01rst application of source-code\ntransformation on a dynamically typed language such as Python. It uses several novel approaches,\nsuch as persistent data structures and lazy evaluation to ensure good performance. Machine learning\nmodels are natural and easy to express and debug in Tangent using many features that are not available\nin other frameworks e.g. mutable arrays, inspectable derivative code, and modifying gradients by\ninjecting arbitrary code in the backward pass.\n\n3For an up to date overview of supported AST nodes please refer to the code in tangent/fence.py.\n\n8\n\n\fWe believe Tangent is an important step on the path to fully general differentiable programming.\nInstead of an ML-framework, Tangent can be seen as the addition of the gradient operator to the\nPython language, without the need for metaprogramming or separate derivative interpreters (OO).\nThis means that the user can write normal Python code while the entire Python ecosystem including\ndebuggers, pro\ufb01lers, and introspection capabilities, become part of the ML toolkit. This allows users\nto express models more naturally and debug them more easily.\n\nReferences\n[1] Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu\nDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensor\ufb02ow: A system for\nlarge-scale machine learning. In OSDI, volume 16, pages 265\u2013283, 2016.\n\n[2] Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau,\nNicolas Ballas, Fr\u00e9d\u00e9ric Bastien, Justin Bayer, Anatoly Belikov, Alexander Belopolsky, et al.\nTheano: A python framework for fast computation of mathematical expressions. arXiv preprint\narXiv:1605.02688, 472:473, 2016.\n\n[3] Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark\narXiv preprint\n\nSiskind. Automatic differentiation in machine learning:\narXiv:1502.05767, 2015.\n\na survey.\n\n[4] Yoshua Bengio, Nicholas L\u00e9onard, and Aaron Courville. Estimating or propagating gradients\nthrough stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.\n\n[5] Christian H Bischof and H Martin B\u00fccker. Computing derivatives of computer programs.\n\nTechnical report, Argonne National Lab., IL (US), 2000.\n\n[6] Jacob Buckman and Graham Neubig. Neural lattice language models. arXiv preprint\n\narXiv:1803.05071, 2018.\n\n[7] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan\nCowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. TVM: An automated end-to-end optimizing\ncompiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and\nImplementation (OSDI 18), pages 578\u2013594. USENIX Association, 2018.\n\n[8] James R Driscoll, Neil Sarnak, Daniel D Sleator, and Robert E Tarjan. Making data structures\n\npersistent. Journal of computer and system sciences, 38(1):86\u2013124, 1989.\n\n[9] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Fran\u00e7ois\nLaviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural\nnetworks. The Journal of Machine Learning Research, 17(1):2096\u20132030, 2016.\n\n[10] Andreas Griewank and Andrea Walther. Evaluating derivatives: principles and techniques of\n\nalgorithmic differentiation, volume 105. Siam, 2008.\n\n[11] Serge Guelton, Pierrick Brunet, Mehdi Amini, Adrien Merlini, Xavier Corbillon, and Alan\nRaynaud. Pythran: Enabling static optimization of scienti\ufb01c python programs. Computational\nScience & Discovery, 8(1):014001, 2015.\n\n[12] Nicolas Heess, Gregory Wayne, David Silver, Tim Lillicrap, Tom Erez, and Yuval Tassa.\nLearning continuous control policies by stochastic value gradients. In Advances in Neural\nInformation Processing Systems, pages 2944\u20132952, 2015.\n\n[13] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.\n\narXiv preprint arXiv:1611.01144, 2016.\n\n[14] Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: A LLVM-based Python JIT\ncompiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in\nHPC, page 7. ACM, 2015.\n\n[15] Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Random\nfeedback weights support learning in deep neural networks. arXiv preprint arXiv:1411.0247,\n2014.\n\n9\n\n\f[16] Dougal Maclaurin, David Duvenaud, and Ryan P Adams. Autograd: Effortless gradients in\n\nnumpy. In ICML 2015 AutoML Workshop, 2015.\n\n[17] Uwe Naumann. The art of differentiating computer programs: an introduction to algorithmic\n\ndifferentiation, volume 24. Siam, 2012.\n\n[18] Arild N\u00f8kland. Direct feedback alignment provides learning in deep neural networks. In\n\nAdvances in Neural Information Processing Systems, pages 1037\u20131045, 2016.\n\n[19] Travis E Oliphant. A guide to NumPy, volume 1. Trelgol Publishing USA, 2006.\n\n[20] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the dif\ufb01culty of training recurrent\nneural networks. In International Conference on Machine Learning, pages 1310\u20131318, 2013.\n\n[21] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. NIPS 2017 Autodiff Workshop, 2017.\n\n[22] Jack Rae, Jonathan J Hunt, Ivo Danihelka, Timothy Harley, Andrew W Senior, Gregory Wayne,\nAlex Graves, and Tim Lillicrap. Scaling memory-augmented neural networks with sparse reads\nand writes. In Advances in Neural Information Processing Systems, pages 3621\u20133629, 2016.\n\n[23] Jeffrey Mark Siskind and Barak A Pearlmutter. Perturbation confusion and referential trans-\n\nparency: Correct functional implementation of forward-mode ad. 2005.\n\n[24] Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. Chainer: a next-generation open\n\nsource framework for deep learning. In NIPS 2015 LearningSys Workshop, volume 5, 2015.\n\n[25] A\u00e4ron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation\n\nlearning. CoRR, abs/1711.00937, 2017. URL http://arxiv.org/abs/1711.00937.\n\n[26] Ronald J Williams and Jing Peng. An ef\ufb01cient gradient-based algorithm for on-line training of\n\nrecurrent network trajectories. Neural computation, 2(4):490\u2013501, 1990.\n\n10\n\n\f", "award": [], "sourceid": 3090, "authors": [{"given_name": "Bart", "family_name": "van Merrienboer", "institution": "MILA, Google"}, {"given_name": "Dan", "family_name": "Moldovan", "institution": "Google"}, {"given_name": "Alexander", "family_name": "Wiltschko", "institution": "Google Brain"}]}