{"title": "Learning Libraries of Subroutines for Neurally\u2013Guided Bayesian Program Induction", "book": "Advances in Neural Information Processing Systems", "page_first": 7805, "page_last": 7815, "abstract": "Successful approaches to program induction require a hand-engineered\n  domain-specific language (DSL), constraining the space of allowed\n  programs and imparting prior knowledge of the domain.  We contribute\n  a program induction algorithm that learns a DSL while\n  jointly training a neural network to efficiently search for programs\n  in the learned DSL.  We use our model to synthesize functions on lists,\n  edit text, and solve symbolic regression problems, showing how the\n  model learns a domain-specific library of program components for\n  expressing solutions to problems in the domain.", "full_text": "Library Learning for Neurally-Guided Bayesian\n\nProgram Induction\n\nKevin Ellis\n\nMIT\n\nLucas Morales\n\nMIT\n\nellisk@mit.edu\n\nlucasem@mit.edu\n\nMathias Sabl\u00e9-Meyer\n\nENS Paris-Saclay\nmathsm@mit.edu\n\nArmando Solar-Lezama\n\nMIT\n\nJoshua B. Tenenbaum\n\nMIT\n\nasolar@csail.mit.edu\n\njbt@mit.edu\n\nAbstract\n\nSuccessful approaches to program induction require a hand-engineered domain-\nspeci\ufb01c language (DSL), constraining the space of allowed programs and imparting\nprior knowledge of the domain. We contribute a program induction algorithm\ncalled EC2 that learns a DSL while jointly training a neural network to ef\ufb01ciently\nsearch for programs in the learned DSL. We use our model to synthesize functions\non lists, edit text, and solve symbolic regression problems, showing how the model\nlearns a domain-speci\ufb01c library of program components for expressing solutions to\nproblems in the domain.\n\n1\n\nIntroduction\n\nMuch of everyday human thinking and learning can be understood in terms of program induction:\nconstructing a procedure that maps inputs to desired outputs, based on observing example input-\noutput pairs. People can induce programs \ufb02exibly across many different domains, and remarkably,\noften from just one or a few examples. For instance, if shown that a text-editing program should map\n\u201cJane Morris Goodall\u201d to \u201cJ. M. Goodall\u201d, we can guess it maps \u201cRichard Erskine Leakey\u201d to \u201cR. E.\nLeakey\u201d; if instead the \ufb01rst input mapped to \u201cDr. Jane\u201d, \u201cGoodall, Jane\u201d, or \u201dMorris\u201d, we might have\nguessed the latter should map to \u201cDr. Richard\u201d, \u201cLeakey, Richard\u201d, or \u201cErskine\u201d, respectively.\nThe FlashFill system [1] developed by Microsoft researchers and now embedded in Excel solves\nproblems such as these and is probably the best known practical program-induction algorithm, but\nresearchers in programming languages and AI have built successful program induction algorithms\nfor many applications, such as handwriting recognition and generation [2], procedural graphics [3],\ncognitive modeling [4], question answering [5] and robot motion planning [6], to name just a few.\nThese systems work in different ways, but most hinge upon having a carefully engineered Domain\nSpeci\ufb01c Language (DSL). This is especially true for systems such as FlashFill that aim to induce\na wide range of programs very quickly, in a few seconds or less. DSLs constrain the search over\nprograms with strong prior knowledge in the form of a restricted set of programming primitives tuned\nto the needs of the domain: for text editing, these are operations like appending strings and splitting\non characters.\nIn this work, we consider the problem of building agents that learn to solve program induction tasks,\nand also the problem of acquiring the prior knowledge necessary to quickly solve these tasks in a new\ndomain. Representative problems in three domains are shown in Table 1. Our solution is an algorithm\nthat grows or boostraps a DSL while jointly training a neural network to help write programs in the\nincreasingly rich DSL.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fs [7 2 3]\u2192[7 3]\n[1 2 3 4]\u2192[3 4]\n[4 3 2 1]\u2192[4 3]\nf ((cid:96)) =(f0 (cid:96) (\u03bb (x)\n(> x 2)))\n\nk\ns\na\nT\n&\n\ns\n\n[2 7 8 1]\u21928\n[3 19 14]\u219219\nf ((cid:96)) =(f1 (cid:96))\n\nm\na\nr\ng\no\nr\nP\n\nL\nS\nD\n\nList Functions\n\nText Editing\n\nSymbolic Regression\n\n[7 3]\u2192False\n[3]\u2192False\n[9 0 0]\u2192True\n[0]\u2192True\n[0 7 3]\u2192True\nf ((cid:96)) =(f2 (cid:96) 0)\n\n+106 769-438\u2192106.769.438\n+83 973-831\u219283.973.831\nf (s) =(f0 \".\" \"-\"\n\n(f0 \".\" \" \"\n(cdr s)))\nTemple Anna H \u2192TAH\n\nLara Gregori\u2192LG\nf (s) =(f2 s)\n\nf (x) =(f1 x) f (x) =(f6 x)\n\nf (x) =(f4 x) f (x) =(f3 x)\n\nf0((cid:96),p) = (foldr (cid:96) nil (\u03bb (x a)\n\n(if (p x) (cons x a) a)))\n\n(f0: Higher-order \ufb01lter function)\nf1((cid:96)) = (foldr (cid:96) 0 (\u03bb (x a)\n\n(if (> a x) a x)))\n(f1: Maximum element in list (cid:96))\n\nf2((cid:96),k) = (foldr (cid:96) (is-nil (cid:96))\n\n(\u03bb (x a) (if a a (= k x))))\n\n(f2: Whether (cid:96) contains k)\n\nf0(s,a,b) = (map (\u03bb (x)\n\n(if (= x a) b x)) s)\n\n(f0: Performs character substitution)\n\nf1(s,c) = (foldr s s (\u03bb (x a)\n\n(cdr (if (= c x) s a))))\n\n(f1: Drop characters from s until c reached)\n\nf2(s) = (unfold s is-nil car\n(\u03bb (z) (f1 z \" \")))\n\n(f2: Abbreviates a sequence of words)\n\nf3(a,b) = (foldr a b cons)\n\n(f3: Concatenate strings a and b)\n\nf0(x) = (+ x real)\nf1(x) = (f0 (* real x))\nf2(x) = (f1 (* x (f0 x)))\nf3(x) = (f0 (* x (f2 x)))\nf4(x) = (f0 (* x (f3 x)))\n\n(f4: 4th order polynomial)\n\nf5(x) = (/ real x)\nf6(x) = (f4 (f0 x))\n(f6: rational function)\n\nTable 1: Top: Tasks from each domain, each followed by the programs EC2 discovers for them.\nBottom: Several examples from learned DSL. Notice that learned DSL primitives can call each other,\nand that EC2 rediscovers higher-order functions like filter (f0 in List Functions)\n\nBecause any computable learning problem can in principle be cast as program induction, it is important\nto delimit our focus. In contrast to computer assisted programming [7] or genetic programming [8],\nour goal is not to automate software engineering, to learn to synthesize large bodies of code, or to\nlearn complex programs starting from scratch. Ours is a basic AI goal: capturing the human ability\nto learn to think \ufb02exibly and ef\ufb01ciently in new domains \u2014 to learn what you need to know about a\ndomain so you don\u2019t have to solve new problems starting from scratch. We are focused on the kinds\nof problems that humans can solve relatively quickly, once they acquire the relevant domain expertise.\nThese correspond to tasks solved by short programs \u2014 if you have an expressive DSL. Even with a\ngood DSL, program search may be intractable; so we amortize the cost of search by training a neural\nnetwork to assist the search procedure.\nOur algorithm takes inspiration from several ways that skilled human programmers have learned to\ncode: skilled coders build libraries of reusable subroutines that are shared across related programming\ntasks, and can be composed to generate increasingly complex and powerful subroutines. In text\nediting, a good library should support routines for splitting on characters, but also specialize these\nroutines to split on particular characters such as spaces or commas that are frequently used to delimit\nsubstrings across tasks. Skilled coders also learn to recognize what kinds of programming idioms\nand library routines would be useful for solving the task at hand, even if they cannot instantly work\nout the details. In text editing, one might learn that if outputs are consistently shorter than inputs,\nremoving characters is likely to be part of the solution; if every output contains a constant substring\n(e.g., \u201cDr.\u201d), inserting or appending that constant string is likely to be a subroutine.\nOur EC2 (ECC, for Explore/Compress/Compile) algorithm incorporates these insights by iterating\nthrough three steps. The Explore step takes a given set of tasks, typically several hundred, and\nexplores the space of programs, searching for compact programs that solve these tasks, guided by the\ncurrent DSL and neural network. The Compress step grows the library (or DSL) of domain-speci\ufb01c\nsubroutines which allow the agent to more compactly write programs in the domain; it modi\ufb01es\nthe structure of the DSL by discovering regularities across programs found during the Explore step,\ncompressing them to distill out common code fragments across successful programs. The Compile\nstep improves the search procedure by training a neural network to write programs in the current\nDSL, in the spirit of \u201camortized\u201d or \u201ccompiled\u201d inference [9, 10].\nThe learned DSL effectively encodes a prior on programs likely to solve tasks in the domain, while\nthe neural net looks at the example input-output pairs for a speci\ufb01c task and produces a \u201cposterior\u201d for\nprograms likely to solve that speci\ufb01c task. The neural network thus functions as a recognition model\nsupporting a form of approximate Bayesian program induction, jointly trained with a generative\nmodel for programs encoded in the DSL, in the spirit of the Helmholtz machine [11]). The recognition\n\n2\n\n\fmodel ensures that searching for programs remains tractable even as the DSL (and hence the search\nspace for programs) expands.\nWe apply EC2 to three domains: list processing; text editing (in the style of FlashFill [1]); and\nsymbolic regression. For each of these we initially provide a generic set of programming primitives.\nOur algorithm then constructs its own DSL for expressing solutions in the domain (Tbl. 1).\nPrior work on program learning has largely assumed a \ufb01xed, hand-engineered DSL, both in classic\nsymbolic program learning approaches (e.g., Metagol: [12], FlashFill: [1]), neural approaches (e.g.,\nRobustFill: [13]), and hybrids of neural and symbolic methods (e.g., Neural-guided deductive\nsearch: [14], DeepCoder: [15]). A notable exception is the EC algorithm [16], which also learns\na library of subroutines. We \ufb01nd EC motivating, and go beyond it and other prior work through\nthe following contributions: (1) We show how to learn-to-learn programs in an expressive Lisp-like\nprogramming language, including conditionals, variables, and higher-order recursive functions; (2)\nWe give an algorithm for learning DSLs, built on a formalism known as Fragment Grammars [17];\nand (3) We give a hierarchical Bayesian framing of the problem that allows joint inference of the\nDSL and neural recognition model.\n\n2 The EC2 Algorithm\n\nWe \ufb01rst mathematically describe our 3-step algorithm as an inference procedure for a hierarchical\nBayesian model (Section 2.1), and then describe each step algorithmically in detail (Section 2.2-2.4).\n\n2.1 Hierarchical Bayesian Framing\n\nEC2 takes as input a set of tasks, written X, each of which is a program synthesis problem. It has at\nits disposal a domain-speci\ufb01c likelihood model, written P[x|p], which scores the likelihood of a task\nx \u2208 X given a program p. Its goal is to solve each of the tasks by writing a program, and also to\ninfer a DSL, written D. We equip D with a real-valued weight vector \u03b8, and together (D, \u03b8) de\ufb01ne a\ngenerative model over programs. We frame our goal as maximum a posteriori (MAP) inference of\n(D, \u03b8) given X. Writing J for the joint probability of (D, \u03b8) and X, we want the D\u2217 and \u03b8\u2217 solving:\n\nD\u2217 = arg max\n\nD\n\nJ(D, \u03b8) d\u03b8\n\nJ(D\u2217, \u03b8)\n\n(1)\n\nThe above equations summarize the problem from the point of view of an ideal Bayesian learner.\nHowever, Eq. 1 is wildly intractable because evaluating J(D, \u03b8) involves summing over the in\ufb01nite\nset of all programs. In practice we will only ever be able to sum over a \ufb01nite set of programs. So, for\neach task, we de\ufb01ne a \ufb01nite set of programs, called a frontier, and only marginalize over the frontiers:\nDe\ufb01nition. A frontier of task x, written Fx, is a \ufb01nite set of programs s.t. P[x|p] > 0 for all p \u2208 Fx.\nUsing the frontiers we de\ufb01ne the following intuitive lower bound on the joint probability, called L :\n\nJ \u2265 L (cid:44) P[D, \u03b8]\n\nP[x|p]P[p|D, \u03b8]\n\n(2)\n\n(cid:89)\n\n(cid:88)\n\nx\u2208X\n\np\u2208Fx\n\nEC2 does approximate MAP inference by maximizing this lower bound on the joint probability,\nalternating maximization w.r.t. the frontiers (Explore) and the DSL (Compress):\nExplore: Maxing L w.r.t. the frontiers. Here (D, \u03b8) is \ufb01xed and we want to \ufb01nd new programs\nto add to the frontiers so that L increases the most. L most increases by \ufb01nding programs where\nP[x|p]P[p|D, \u03b8] is large.\n\nCompress: Maxing(cid:82) L d\u03b8 w.r.t. the DSL. Here {Fx}x\u2208X is held \ufb01xed, and so we can evaluate\n(cid:82) L d\u03b8. Once we have a DSL D we can update \u03b8 to arg max\u03b8\n\nL . Now the problem is that of searching the discrete space of DSLs and \ufb01nding one maximizing\n\nL (D, \u03b8,{Fx}).\n\nSearching for programs is hard because of the large combinatorial search space. We ease this dif\ufb01culty\nby training a neural recognition model, q(\u00b7|\u00b7), during the Compile step: q is trained to approximate\n\n3\n\nJ(D, \u03b8) (cid:44) P[D, \u03b8]\n\nP[x|p]P[p|D, \u03b8]\n\n(cid:90)\n\n(cid:89)\n\n(cid:88)\n\np\n\nx\u2208X\n\u03b8\u2217 = arg max\n\n\u03b8\n\n\fthe posterior over programs, q(p|x) \u2248 P[p|x,D, \u03b8] \u221d P[x|p]P[p|D, \u03b8], thus amortizing the cost of\n\ufb01nding programs with high posterior probability.\nCompile: learning to tractably maximize L w.r.t. the frontiers. Here we train q(p|x) to assign\nhigh probability to programs p where P[x|p]P[p|D, \u03b8] is large, because including those programs in\nthe frontiers will most increase L . We train q both on programs found during the Explore step and\non samples from the current DSL.\nCrucially, each of these three steps\nbootstraps the others (Fig. 1): improv-\ning either the DSL or the recognition\nmodel makes search easier, so we \ufb01nd\nmore programs solving tasks; both im-\nproving the DSL and solving more\ntasks expands the training data for the\nrecognition model; and \ufb01nding more\nprograms that solve tasks gives more\ndata from which to learn a DSL.\n\nTrains\n(Compress)\n\nSearch for\nprograms: p\n\nMakes tractable\n\nrain\npile)\n\ns\n\nInductive bias\n\nDSL: D\n\nRecognition\n\nmodel: q\n\nTrains\n\n(Compile)\n\n(Explore)\n\n(Explore)\n\nT\n\n(\nC\n\no\n\nm\n\n2.2 Explore: Searching for Programs\n\nFigure 1: How these steps bootstrap each other.\n\nNow our goal is to search for programs solving the tasks. We use the simple approach of enumerating\nprograms from the DSL in decreasing order of their probability, and then checking if a program p\nassigns positive probability to a task (P[x|p] > 0); if so, we incorporate p into the frontier Fx.\nTo make this concrete we need to de\ufb01ne what programs actually are and what form P[p|D, \u03b8] takes.\nWe represent programs as \u03bb-calculus expressions. \u03bb-calculus is a formalism for expressing functional\nprograms that closely resembles Lisp, including variables, function application, and the ability to\ncreate new functions. Throughout this paper we will write \u03bb-calculus expressions in Lisp syntax. Our\nprograms are all strongly typed. We use the Hindley-Milner polymorphic typing system [18] which\nis used in functional programming languages like OCaml and Haskell. We now de\ufb01ne DSLs:\nDe\ufb01nition: (D, \u03b8). A DSL D is a set of typed \u03bb-calculus expressions. A weight vector \u03b8 for a DSL\nD is a vector of |D| + 1 real numbers: one number for each DSL element e \u2208 D, written \u03b8e and\ncontrolling the probability of e occurring in a program, and a weight controlling the probability of a\nvariable occurring in a program, \u03b8var.\nTogether with its weight vector, a DSL de\ufb01nes a distribution over programs, P[p|D, \u03b8]. In the\nsupplement, we de\ufb01ne this distribution by specifying a procedure for drawing samples from P[p|D, \u03b8].\nWhy enumerate, when the program synthesis community has invented many sophisticated algorithms\nthat search for programs? [7, 19, 20, 21, 22, 23]. We have two reasons: (1) A key point of our work is\nthat learning the DSL, along with a neural recognition model, can make program induction tractable,\neven if the search algorithm is very simple. (2) Enumeration is a general approach that can be applied\nto any program induction problem. Many of these more sophisticated approaches require special\nconditions on the space of programs.\nHowever, a drawback of enumerative search is that we have no ef\ufb01cient means of solving for arbitrary\nconstants that might occur in a program. In Sec. 4, we will show how to \ufb01nd programs with real-\nvalued constants by automatically differentiating through the program and setting the constants using\ngradient descent.\n\n2.3 Compile: Learning a Neural Recognition Model\n\nThe purpose of training the recognition model is to amortize the cost of searching for programs. It\ndoes this by learning to predict, for each task, programs with high likelihood according to P[x|p]\nwhile also being probable under the prior (D, \u03b8). Concretely, the recognition model q predicts, for\neach task x \u2208 X, a weight vector q(x) = \u03b8(x) \u2208 R|D|+1. Together with the DSL, this de\ufb01nes a\ndistribution over programs, P[p|D, \u03b8 = q(x)]. We abbreviate this distribution as q(p|x). The crucial\naspect of this framing is that the neural network leverages the structure of the learned DSL, so it is not\nresponsible for generating programs wholesale. We share this aspect with DeepCoder [15] and [24].\n\n4\n\n\fHow should we get the data to train q? This is non-obvious because we are considering a weakly\nsupervised setting (i.e., learning only from tasks and not from task/program pairs). One approach is\nto sample programs from the DSL, run them to get their input/outputs, and then train q to predict the\nprogram from the input/outputs. This is like how the wake-sleep algorithm for the Helmholtz machine\ntrains its recognition model during its sleep phase [25]. The advantage of training on samples, or\n\u201cfantasies,\u201d is that we can draw unlimited samples from the DSL, training on a large amount of data.\nAnother approach is to train q on the (program, task) pairs discovered by the Explore step. The\nadvantage here is that the training data is much higher quality, because we are training on real tasks.\nDue to these complementary advantages, we train on both these sources of data.\nFormally, q should approximate the true posteriors over programs: minimizing the expected KL-\nP[p|x,D, \u03b8] log q(p|x)],\nwhere the expectation is taken over tasks. Taking this expectation over the empirical distribution of\ntasks trains q on the real data; taking it over samples from the generative model trains q on \u201cfantasies.\u201d\nThe objective for a recognition model (LRM) combines the fantasy (Lf) and real-data (Lr) objectives,\nLRM = Lr + Lf:\n\ndivergence, E [KL (P[p|x,D, \u03b8](cid:107)q(p|x))], equivalently maximizing E[(cid:80)\n\np\n\nLf = E(p,x)\u223c(D,\u03b8) [log q(p|x)] Lr = Ex\u223cX\n\n\uf8ee\uf8f0(cid:88)\n\np\u2208Fx\n\n(cid:80)\n\nP [x, p|D, \u03b8]\np(cid:48)\u2208Fx\n\nP [x, p(cid:48)|D, \u03b8]\n\nlog q(p|x)\n\n\uf8f9\uf8fb\n\n2.4 Compress: Learning a Generative Model (a DSL)\n\n(cid:88)\n\nx\u2208X\n\n(cid:88)\n\np\u2208Fx\n\nThe purpose of the DSL is to offer a set of abstractions that allow an agent to easily express solutions\nto the tasks at hand. Intuitively, we want the algorithm to look at the frontiers and generalize beyond\nthem, both so the DSL can better express the current solutions, and also so that the DSL might expose\nnew abstractions which will later be used to discover more programs. Formally, we want the DSL\n\nmaximizing(cid:82) L d\u03b8 (Sec. 2.1). We replace this marginal with an AIC approximation, giving the\n\nfollowing objective for DSL induction:\n\nlog P[D] + arg max\n\n\u03b8\n\nlog\n\nP[x|p]P[p|D, \u03b8] + log P[\u03b8|D] \u2212 (cid:107)\u03b8(cid:107)0\n\n(3)\n\nWe induce a DSL by searching locally through the space of DSLs, proposing small changes to D\nuntil Eq. 3 fails to increase. The search moves work by introducing new \u03bb-expressions into the\nDSL. We propose these new expressions by extracting fragments of programs already in the frontiers\n(Tbl. 2). An important point here is that we are not simply adding subexpressions of programs to\nD, as done in the EC algorithm [16] and other prior work [26]. Instead, we are extracting fragments\nthat unify with programs in the frontiers. This idea of storing and reusing fragments of expressions\ncomes from Fragment Grammars [17] and Tree-Substitution Grammars [27], and is closely related\nto the idea of antiuni\ufb01cation [28, 29]. Care must be taken to ensure that this \u2018fragmenting\u2019 obeys\nvariable scoping rules; Section 4 of the supplement gives an overview of Fragment Grammars and\nhow we adapt them to the lexical scoping rules of \u03bb-calculus. To de\ufb01ne the prior distribution over\n(D, \u03b8), we penalize the syntactic complexity of the \u03bb-calculus expressions in the DSL, de\ufb01ning\np\u2208D size(p)) where size(p) measures the size of the syntax tree of program p,\n\nP[D] \u221d exp(\u2212\u03bb(cid:80)\n\nand place a symmetric Dirichlet prior over the weight vector \u03b8.\nPutting all these ingredients together, Alg. 1 describes how we combine program search, recognition\nmodel training, and DSL induction. For added robustness, we interleave an extra program search step\n(Explore) before training the recognition model, and just enumerate from the prior (D, \u03b8) during this\nextra Explore step.\n\n3 Programs that manipulate sequences\n\nWe apply EC2 to list processing (Section 3.1) and text editing (Section 3.2). For both these domains\nwe use a bidirectional GRU [30] for the recognition model, and initially provide the system with a\ngeneric set of list processing primitives: foldr, unfold, if, map, length, index, =, +, -, 0, 1, cons,\ncar, cdr, nil, and is-nil.\n\n5\n\n\fm\na\nr\ng\no\nr\np\n\nm\na\nr\ng\no\nr\np\n\nt\nn\ne\nm\ng\na\nr\nf\n\ncons\n\nExample programs in frontiers\n\nProposed \u03bb-expression\n\n+\n\n11\n\n(\u03bb ((cid:96)) (map (\u03bb (x) (index x (cid:96)))\n\n(range (- (length (cid:96)) 1))))\n\n(map (\u03bb (x) (index x (cid:96)))\n\n(\u03bb ((cid:96)) (map (\u03bb (x) (index x (cid:96)))\n\n(range \u03b1))\n\n(range (+ 1 1))))\n\n(\u03bb (s) (map (\u03bb (x)\n\n(if (= x '.')\n\n'-' x))) s)\n\n(\u03bb (s) (map (\u03bb (x)\n\n(if (= x '-') ',' x))) s)\n\n+\n\ncar\n\n1\nz\n\n+\n\n1\n\n(\u03bb (s) (map (\u03bb (x)\n\n(if (= x \u03b1) \u03b2 x))) s)\n\nFigure 2: Left: syntax trees of two programs sharing common structure, highlighted in orange,\nfrom which we extract a fragment and add it to the DSL (bottom). Right: actual programs, from\nwhich we extract fragments that (top) slice from the beginning of a list or (bottom) perform character\nsubstitutions.\n\nAlgorithm 1 The EC2 Algorithm\n\nInput: Initial DSL D, set of tasks X, iterations I\nHyperparameters: Enumeration timeout T\nInitialize \u03b8 \u2190 uniform\nfor i = 1 to I do\n\nFor each task x \u2208 X, set F \u03b8\nq \u2190 train recognition model, maximizing LRM (see Sec. 2.3)\nFor each task x \u2208 X, set F q\nD, \u03b8 \u2190induceDSL({F \u03b8\nx \u222a F q\n\nx \u2190 {p|p \u2208 enum(D, \u03b8, T ) if P[x|p] > 0}\n(Explore)\n(Compile)\nx \u2190 {p|p \u2208 enum(D, q(x), T ) if P[x|p] > 0} (Explore)\nx}x\u2208X ) (see Sec. 2.4)\n(Compress)\n\nend for\nreturn D, \u03b8, q\n\n3.1 List Processing\n\nSynthesizing programs that manipulate data structures is a widely studied problem in the programming\nlanguages community [20]. We consider this problem within the context of learning functions that\nmanipulate lists, and which also perform arithmetic operations upon lists of numbers.\nWe created 236 human-interpretable list\nmanipulation tasks, each with 15 input/out-\nput examples (Tbl. 2). Our data set is in-\nteresting in three major ways: many of\nthe tasks require complex solutions; the\ntasks were not generated from some latent\nDSL; and the agent must learn to solve\nthese complicated problems from only 236\ntasks. Our data set assumes arithmetic op-\nerations as well as sequence operations, so\nwe additionally provide our system with\nthe following arithmetic primitives: mod, *,\n>, is-square, is-prime.\nWe evaluated EC2 on random 50/50 test/train split. Interestingly, we found that the recognition\nmodel provided little bene\ufb01t for the training tasks. However, it yielded faster search times on held out\ntasks, allowing more tasks to be solved before timing out. The system composed 38 new subroutines,\nyielding a more expressive DSL more closely matching the domain (left of Tbl. 1, right of Fig. 2).\nSee the supplement for a complete list of DSL primitives discovered by EC2.\n\nName\nrepeat-2\ndrop-3\nrotate-2\ncount-head-in-tail\nkeep-mod-5\nproduct\n\nTable 2: Some tasks in our list function domain. See the\nsupplement for the complete data set.\n\nInput\n[7 0]\n[0 3 8 6 4]\n[8 14 1 9]\n[1 2 1 1 3]\n[5 9 14 6 3 0]\n[7 1 6 2]\n\nOutput\n[7 0 7 0]\n[6 4]\n[1 9 8 14]\n2\n[5 0]\n84\n\n6\n\n\f3.2 Text Editing\n\nSynthesizing programs that edit text is a classic problem in the programming languages and AI\nliteratures [24, 31], and algorithms that learn text editing programs ship in Microsoft Excel [1].\nThis prior work presumes a hand-engineered DSL. We show EC2 can instead start out with generic\nsequence manipulation primitives and recover many of the higher-level building blocks that have\nmade these other text editing systems successful.\nBecause our enumerative search procedure cannot generate string constants, we instead enumerate\nprograms with string-valued parameters. For example, to learn a program that prepends \u201cDr.\u201d, we\nenumerate (f3 string s) \u2013 where f3 is the learned appending primitive (Fig. 1) \u2014 and then de\ufb01ne\nP[x|p] by approximately marginalizing out the string parameters via a simple dynamic program. In\nSec. 4, we will use a similar trick to synthesize programs containing real numbers, but using gradient\ndescent instead of dynamic programming.\nWe trained our system on a corpus of 109 automatically generated text editing tasks, with 4 input/out-\nput examples each. After three iterations, it assembles a DSL containing a dozen new functions (center\nof Fig. 1) that let it solve all of the training tasks. But, how well does the learned DSL generalized to\nreal text-editing scenarios? We tested, but did not train, on the 108 text editing problems from the\nSyGuS [32] program synthesis competition. Before any learning, EC2 solves 3.7% of the problems\nwith an average search time of 235 seconds. After learning, it solves 74.1%, and does so much faster,\nsolving them in an average of 29 seconds. As of the 2017 SyGuS competition, the best-performing\nalgorithm solves 82.4% of the problems. But, SyGuS comes with a different hand-engineered DSL\nfor each text editing problem.1 Here we learned a single DSL that applied generically to all of the\ntasks, and perform comparably to the best prior work.\n\n4 Symbolic Regression: Programs from visual input\n\nWe apply EC2 to symbolic regression problems. Here, the agent observes points along the curve\nof a function, and must write a program that \ufb01ts those points. We initially equip our learner with\naddition, multiplication, and division, and task it with solving 100 symbolic regression problems, each\neither a polynomial of degree 1\u20134 or a rational function. The recognition model is a convolutional\nnetwork that observes an image of the target function\u2019s graph (Fig. 3) \u2014 visually, different kinds of\npolynomials and rational functions produce different kinds of graphs, and so the recognition model\ncan learn to look at a graph and predict what kind of function best explains it. A key dif\ufb01culty,\nhowever, is that these problems are best solved with programs containing real numbers. Our solution\nto this dif\ufb01culty is to enumerate programs with real-valued parameters, and then \ufb01t those parameters\nby automatically differentiating through the programs the system writes and use gradient descent to\n\ufb01t the parameters. We de\ufb01ne the likelihood model, P[x|p], by assuming a Gaussian noise model for\nthe input/output examples, and penalize the use of real-valued parameters using the BIC [33].\nEC2 learns a DSL containing 13 new func-\ntions, most of which are templates for poly-\nnomials of different orders or ratios of poly-\nnomials. It also learns to \ufb01nd programs\nthat minimize the number of continuous de-\ngrees of freedom. For example, it learns to\nrepresent linear functions with the program\n(* real (+ x real)), which has two con-\ntinuous degrees of freedom, and represents\nquartic functions using the invented DSL\nprimitive f4 in the rightmost column of\nFig. 1 which has \ufb01ve continuous param-\neters. This phenomenon arises from our\nBayesian framing \u2014 both the implicit bias\ntowards shorter programs and the likeli-\nhood model\u2019s BIC penalty.\n\nFigure 3: Recognition model input for symbolic regres-\nsion. DSL learns subroutines for polynomials (top row)\nand rational functions (bottom row) while the recog-\nnition model jointly learns to look at a graph of the\nfunction (above) and predict which of those subroutines\nbest explains the observation.\n\n1SyGuS text editing problems also prespecify the set of allowed string constants for each task. For these\n\nexperiments, our system did not use this assistance.\n\n7\n\n\f5 Quantitative Results\n\nWe compare with ablations of our model on held out tasks. The purpose of this ablation study is\nboth to examine the role of each component of EC2, as well as to compare with prior approaches\nin the literature: a head-to-head comparison of program synthesizers is complicated by the fact that\neach system, including ours, makes idiosyncratic assumptions about the space of programs and the\nstatement of tasks. Nevertheless, much prior work can be modeled within our setup. We compare\nwith the following ablations (Tbl 3; Fig 4):\nNo NN: lesions the recognition model.\nNPS, which does not learn the DSL, instead learning the recognition model from samples drawn\nfrom the \ufb01xed DSL. We call this NPS (Neural Program Synthesis) because this is closest to how\nRobustFill [13] and DeepCoder [15] are trained.\nSE, which lesions the recognition model and restricts the DSL learning algorithm to only add\nSubExpressions of programs in the frontiers to the DSL. This is how most prior approaches have\nlearned libraries of functions [16, 34, 26].\nPCFG, which lesions the recognition model and does not learn the DSL, but instead learns the\nparameters of the DSL (\u03b8), learning the parameters of a PCFG while not learning any of the structure.\nEnum, which enumerates a frontier without any learning \u2014 equivalently, our \ufb01rst Explore step.\nWe are interested both in how many tasks\nthe agent can solve and how quickly it can\n\ufb01nd those solutions. Tbl. 3 compares our\nmodel against these alternatives. We con-\nsistently improve on the baselines, and also\n\ufb01nd that lesioning the recognition model\nimpairs the convergence of the algorithm,\ncausing it to hit a lower \u2018plateau\u2019 after\nwhich it stops solving new tasks, following\nan initial spurt of learning (Fig. 4) \u2013 with-\nout the neural network, search becomes in-\ntractable. This lowered \u2018plateau\u2019 supports\na view of the recognition model as a way\nof amortizing the cost of search.\n\nTable 3: % held-out test tasks solved. Solve time: aver-\naged over solved tasks.\n\n74% 43% 30% 33% 0% 4%\n235s\n\n94% 79% 71% 35% 62% 37%\n20s\n\n11s 35s\n\n38s 80s\nSymbolic Regression\n\nOurs No NN SE NPS PCFG Enum\n\n% solved\nSolve time 88s\n\n% solved\nSolve time 24s\n\n84% 75% 62% 38% 38% 37%\n29s\n\n28s 31s\n\n40s\n\n55s\n\n39s\nText Editing\n\n% solved\nSolve time 29s\n\n49s\n\nList Processing\n\n44s\n\n\u2013\n\nFigure 4: Learning curves for EC2 both with (in orange) and without (in teal) the recognition model.\nSolid lines: % holdout testing tasks solved. Dashed lines: Average solve time.\n\n6 Related Work\n\nOur work is far from the \ufb01rst for learning to learn programs, an idea that goes back to Solomonoff [35]:\nDeep learning: Much recent work in the ML community has focused on creating neural networks\nthat regress from input/output examples to programs [13, 6, 24, 15]. EC2\u2019s recognition model draws\nheavily from this line of work, particularly from [24]. We see these prior works as operating in\na different regime: typically, they train with strong supervision (i.e., with annotated ground-truth\nprograms) on massive data sets (i.e., hundreds of millions [13]). Our work considers a weakly-\nsupervised regime where ground truth programs are not provided and the agent must learn from at\nmost a few hundred tasks, which is facilitated by our \u201cHelmholtz machine\u201d style recognition model.\n\n8\n\n012345Iteration020406080100% Solved (solid)020406080Solve time (dashed)Symbolic Regression\fInventing new subroutines for program induction: Several program induction algorithms, most\nprominently the EC algorithm [16], take as their goal to learn new, reusable subroutines that are shared\nin a multitask setting. We \ufb01nd this work inspiring and motivating, and extend it along two dimensions:\n(1) we propose a new algorithm for inducing reusable subroutines, based on Fragment Grammars [17];\nand (2) we show how to combine these techniques with bottom-up neural recognition models. Other\ninstances of this related idea are [34], Schmidhuber\u2019s OOPS model [36], MagicHaskeller [37],\nBayesian program merging [29], and predicate invention in Inductive Logic Programming [26].\nClosely allied ideas have been applied to mining \u2018code idioms\u2019 from programs [38], and, concurrent\nwith this work, using those idioms to better synthesize functional programs from natural language [39].\nBayesian Program Learning: Our work is an instance of Bayesian Program Learning (BPL;\nsee [2, 16, 40, 34, 41]). Previous BPL systems have largely assumed a \ufb01xed DSL (but see [34]), and\nour contribution here is a general way of doing BPL with less hand-engineering of the DSL.\n\n7 Discussion\n\nWe contribute an algorithm, EC2, that learns to program by bootstrapping a DSL with new domain-\nspeci\ufb01c primitives that the algorithm itself discovers, together with a neural recognition model that\nlearns how to ef\ufb01ciently deploy the DSL on new tasks. We believe this integration of top-down\nsymbolic representations and bottom-up neural networks \u2014 both of them learned \u2014 helps make\nprogram induction systems more generally useful for AI.\nA feature of our system is that it learns from (and also, critically needs) a corpus of training tasks.\nIs constructing (or curating) corpra of tasks any easier or better than hand-engineering DSLs? In\nthe immediate future, we expect some degree of hand-engineering of DSLs to continue, especially\nin domains where humans have strong intuitions about the underlying system of domain-speci\ufb01c\nconcepts, like text editing. However, if program induction is to become a standard part of the AI\ntoolkit, then, in the long-term, we need to build agents that autonomously acquire the knowledge\nneeded to navigate a new domain. So, through the lens of program synthesis, EC2 carries the\nrestriction that it requires a high-quality corpus of training tasks; but, for the program-induction\napproach to AI, this restriction is a feature, not a bug.\nMany directions remain open. Two immediate goals are to integrate more sophisticated neural\nrecognition models [13] and program synthesizers [7], which may improve performance in some\ndomains over the generic methods used here: while our focus in this work was learning to quickly\nwrite small programs, we believe more sophisticated neural models, coupled with more powerful\nprogram search algorithms, could extend our approach to synthesize larger bodies of code. Another\ndirection is to explore DSL meta-learning: can we \ufb01nd a single universal primitive set that could\neffectively bootstrap DSLs for new domains, including the three domains considered, but also many\nothers?\n\nAcknowledgments\n\nWe are grateful for collaborations with Eyal Dechter, whose EC algorithm directly inspired this work,\nand for funding from the NSF GRFP, AFOSR award FA9550-16-1-0012, the MIT-IBM Watson AI\nLab, the MUSE program (Darpa grant FA8750-14-2-0242), and an AWS ML Research Award. This\nmaterial is based upon work supported by the Center for Brains, Minds and Machines (CBMM),\nfunded by NSF STC award CCF-1231216.\n\nReferences\n[1] Sumit Gulwani. Automating string processing in spreadsheets using input-output examples. In ACM\n\nSIGPLAN Notices, volume 46, pages 317\u2013330. ACM, 2011.\n\n[2] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through\n\nprobabilistic program induction. Science, 350(6266):1332\u20131338, 2015.\n\n[3] Kevin Ellis, Daniel Ritchie, Armando Solar-Lezama, and Joshua B Tenenbaum. Learning to infer graphics\n\nprograms from hand-drawn images. NIPS, 2018.\n\n[4] Ute Schmid and Emanuel Kitzelmann. Inductive rule learning on the knowledge level. Cognitive Systems\n\nResearch, 12(3-4):237\u2013248, 2011.\n\n9\n\n\f[5] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross\nGirshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In\nCVPR.\n\n[6] Jacob Devlin, Rudy R Bunel, Rishabh Singh, Matthew Hausknecht, and Pushmeet Kohli. Neural program\n\nmeta-induction. In NIPS, 2017.\n\n[7] Armando Solar Lezama. Program Synthesis By Sketching. PhD thesis, 2008.\n\n[8] John R. Koza. Genetic programming - on the programming of computers by means of natural selection.\n\nMIT Press, 1993.\n\n[9] Tuan Anh Le, At\u0131l\u0131m G\u00fcne\u00b8s Baydin, and Frank Wood. Inference Compilation and Universal Probabilistic\n\nProgramming. In AISTATS, 2017.\n\n[10] Andreas Stuhlm\u00fcller, Jacob Taylor, and Noah Goodman. Learning stochastic inverses. NIPS, 2013.\n\n[11] Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The \"wake-sleep\" algorithm for\n\nunsupervised neural networks. Science, 268(5214):1158\u20131161, 1995.\n\n[12] Stephen H Muggleton, Dianhuan Lin, and Alireza Tamaddoni-Nezhad. Meta-interpretive learning of\n\nhigher-order dyadic datalog: Predicate invention revisited. Machine Learning, 100(1):49\u201373, 2015.\n\n[13] Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet\n\nKohli. Robust\ufb01ll: Neural program learning under noisy i/o. ICML, 2017.\n\n[14] Ashwin Kalyan, Abhishek Mohta, Oleksandr Polozov, Dhruv Batra, Prateek Jain, and Sumit Gulwani.\n\nNeural-guided deductive search for real-time program synthesis from examples. ICLR, 2018.\n\n[15] Matej Balog, Alexander L Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. Deepcoder:\n\nLearning to write programs. ICLR, 2016.\n\n[16] Eyal Dechter, Jon Malmaud, Ryan P. Adams, and Joshua B. Tenenbaum. Bootstrap learning via modular\n\nconcept discovery. In IJCAI, 2013.\n\n[17] Timothy J. O\u2019Donnell. Productivity and Reuse in Language: A Theory of Linguistic Computation and\n\nStorage. The MIT Press, 2015.\n\n[18] Benjamin C. Pierce. Types and programming languages. MIT Press, 2002.\n\n[19] Eric Schkufza, Rahul Sharma, and Alex Aiken. Stochastic superoptimization. In ACM SIGARCH Computer\n\nArchitecture News, volume 41, pages 305\u2013316. ACM, 2013.\n\n[20] John K Feser, Swarat Chaudhuri, and Isil Dillig. Synthesizing data structure transformations from input-\n\noutput examples. In PLDI, 2015.\n\n[21] Peter-Michael Osera and Steve Zdancewic. Type-and-example-directed program synthesis. In ACM\n\nSIGPLAN Notices, volume 50, pages 619\u2013630. ACM, 2015.\n\n[22] Oleksandr Polozov and Sumit Gulwani. Flashmeta: A framework for inductive program synthesis. ACM\n\nSIGPLAN Notices, 50(10):107\u2013126, 2015.\n\n[23] Nadia Polikarpova, Ivan Kuraj, and Armando Solar-Lezama. Program synthesis from polymorphic\n\nre\ufb01nement types. ACM SIGPLAN Notices, 51(6):522\u2013538, 2016.\n\n[24] Aditya Menon, Omer Tamuz, Sumit Gulwani, Butler Lampson, and Adam Kalai. A machine learning\n\nframework for programming by example. In ICML, pages 187\u2013195, 2013.\n\n[25] Peter Dayan, Geoffrey E Hinton, Radford M Neal, and Richard S Zemel. The helmholtz machine. Neural\n\ncomputation, 7(5):889\u2013904, 1995.\n\n[26] Dianhuan Lin, Eyal Dechter, Kevin Ellis, Joshua B. Tenenbaum, and Stephen Muggleton. Bias reformula-\n\ntion for one-shot function induction. In ECAI 2014, 2014.\n\n[27] Trevor Cohn, Phil Blunsom, and Sharon Goldwater. Inducing tree-substitution grammars. JMLR.\n\n[28] Robert John Henderson. Cumulative learning in the lambda calculus. PhD thesis, Imperial College\n\nLondon, 2013.\n\n[29] Irvin Hwang, Andreas Stuhlm\u00fcller, and Noah D Goodman. Inducing probabilistic programs by bayesian\n\nprogram merging. arXiv preprint arXiv:1110.5667, 2011.\n\n10\n\n\f[30] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger\nSchwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical\nmachine translation. arXiv preprint arXiv:1406.1078, 2014.\n\n[31] Tessa Lau. Programming by demonstration: a machine learning approach. PhD thesis, 2001.\n\n[32] Rajeev Alur, Dana Fisman, Rishabh Singh, and Armando Solar-Lezama. Sygus-comp 2016: results and\n\nanalysis. arXiv preprint arXiv:1611.07627, 2016.\n\n[33] Christopher M. Bishop. Pattern Recognition and Machine Learning. 2006.\n\n[34] Percy Liang, Michael I. Jordan, and Dan Klein. Learning programs: A hierarchical bayesian approach. In\n\nICML, 2010.\n\n[35] Ray J Solomonoff. A system for incremental learning based on algorithmic probability. Sixth Israeli\n\nConference on Arti\ufb01cial Intelligence, Computer Vision and Pattern Recognition, 1989.\n\n[36] J\u00fcrgen Schmidhuber. Optimal ordered problem solver. Machine Learning, 54(3):211\u2013254, 2004.\n\n[37] Susumu Katayama. Towards human-level inductive functional programming. In International Conference\n\non Arti\ufb01cial General Intelligence, pages 111\u2013120. Springer, 2015.\n\n[38] Miltiadis Allamanis and Charles Sutton. Mining idioms from source code. In Proceedings of the 22Nd\nACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, pages\n472\u2013483, New York, NY, USA, 2014. ACM.\n\n[39] Richard Shin, Marc Brockschmidt, Miltiadis Allamanis, and Oleksandr Polozov. Program synthesis with\n\nlearned code idioms. Under review, 2018.\n\n[40] Kevin Ellis, Armando Solar-Lezama, and Josh Tenenbaum. Sampling for bayesian program learning. In\n\nAdvances in Neural Information Processing Systems, 2016.\n\n[41] Kevin Ellis, Armando Solar-Lezama, and Josh Tenenbaum. Unsupervised learning by program synthesis.\n\nIn NIPS.\n\n11\n\n\f", "award": [], "sourceid": 4859, "authors": [{"given_name": "Kevin", "family_name": "Ellis", "institution": "MIT"}, {"given_name": "Lucas", "family_name": "Morales", "institution": "MIT"}, {"given_name": "Mathias", "family_name": "Sabl\u00e9-Meyer", "institution": "MIT"}, {"given_name": "Armando", "family_name": "Solar-Lezama", "institution": "MIT"}, {"given_name": "Josh", "family_name": "Tenenbaum", "institution": "MIT"}]}