{"title": "Unsupervised Learning by Program Synthesis", "book": "Advances in Neural Information Processing Systems", "page_first": 973, "page_last": 981, "abstract": "We introduce an unsupervised learning algorithmthat combines probabilistic modeling with solver-based techniques for program synthesis.We apply our techniques to both a visual learning domain and a language learning problem,showing that our algorithm can learn many visual concepts from only a few examplesand that it can recover some English inflectional morphology.Taken together, these results give both a new approach to unsupervised learning of symbolic compositional structures,and a technique for applying program synthesis tools to noisy data.", "full_text": "Unsupervised Learning by Program Synthesis\n\nKevin Ellis\n\nDepartment of Brain and Cognitive Sciences\n\nMassachusetts Institute of Technology\n\nellisk@mit.edu\n\nArmando Solar-Lezama\n\nMIT CSAIL\n\nMassachusetts Institute of Technology\n\nasolar@csail.mit.edu\n\nJoshua B. Tenenbaum\n\nDepartment of Brain and Cognitive Sciences\n\nMassachusetts Institute of Technology\n\njbt@mit.edu\n\nAbstract\n\nWe introduce an unsupervised learning algorithm that combines probabilistic\nmodeling with solver-based techniques for program synthesis. We apply our tech-\nniques to both a visual learning domain and a language learning problem, showing\nthat our algorithm can learn many visual concepts from only a few examples and\nthat it can recover some English in\ufb02ectional morphology. Taken together, these\nresults give both a new approach to unsupervised learning of symbolic composi-\ntional structures, and a technique for applying program synthesis tools to noisy\ndata.\n\n1\n\nIntroduction\n\nUnsupervised learning seeks to induce good latent representations of a data set. Nonparametric\nstatistical approaches such as deep autoencoder networks, mixture-model density estimators, or\nnonlinear manifold learning algorithms have been very successful at learning representations of\nhigh-dimensional perceptual input. However, it is unclear how they would represent more abstract\nstructures such as spatial relations in vision (e.g., inside of or all in a line) [2], or morphological rules\nin language (e.g., the different in\ufb02ections of verbs) [1, 13]. Here we give an unsupervised learning\nalgorithm that synthesizes programs from data, with the goal of learning such concepts. Our ap-\nproach generalizes from small amounts of data, and produces interpretable symbolic representations\nparameterized by a human-readable programming language.\nPrograms (deterministic or probabilistic) are a natural knowledge representation for many domains\n[3], and the idea that inductive learning should be thought of as probabilistic inference over pro-\ngrams is at least 50 years old [6]. Recent work in learning programs has focused on supervised\nlearning from noiseless input/output pairs, or from formal speci\ufb01cations [4]. Our goal here is to\nlearn programs from noisy observations without explicit input/output examples. A central idea in\nunsupervised learning is compression: \ufb01nding data representations that require the fewest bits to\nwrite down. We realize this by treating observed data as the output of an unknown program applied\nto unknown inputs. By doing joint inference over the program and the inputs, we recover compres-\nsive encodings of the observed data. The induced program gives a generative model for the data,\nand the induced inputs give an embedding for each data point.\nAlthough a completely domain general method for program synthesis would be desirable, we be-\nlieve this will remain intractable for the foreseeable future. Accordingly, our approach factors out\nthe domain-speci\ufb01c components of problems in the form of a grammar for program hypotheses, and\nwe show how this allows the same general-purpose tools to be used for unsupervised program syn-\nthesis in two very different domains. In a domain of visual concepts [5] designed to be natural for\n\n1\n\n\fhumans but dif\ufb01cult for machines to learn, we show that our methods can synthesize simple graph-\nics programs representing these visual concepts from only a few example images. These programs\noutperform both previous machine-learning baselines and several new baselines we introduce. We\nalso study the domain of learning morphological rules in language, treating rules as programs and\nin\ufb02ected verb forms as outputs. We show how to encode prior linguistic knowledge as a grammar\nover programs and recover human-readable linguistic rules, useful for both simple stemming tasks\nand for predicting the phonological form of new words.\n\n2 The unsupervised program synthesis algorithm\n\nThe space of all programs is vast and often unamenable to the optimization methods used in much\nof machine learning. We extend two ideas from the program synthesis community to make search\nover programs tractable:\nSketching: In the sketching approach to program synthesis, one manually provides a sketch of the\nprogram to be induced, which speci\ufb01es a rough outline of its structure [7]. Our sketches take the\nform of a probabilistic context-free grammar and make explicit the domain speci\ufb01c prior knowledge.\nSymbolic search: Much progress has been made in the engineering of general-purpose symbolic\nsolvers for Satis\ufb01ability Modulo Theories (SMT) problems [8]. We show how to translate our\nsketches into SMT problems. Program synthesis is then reduced to solving an SMT problem. These\nare intractable in general, but often solved ef\ufb01ciently in practice due to the highly constrained nature\nof program synthesis which these solvers can exploit.\nPrior work on symbolic search from sketches has not had to cope with noisy observations or proba-\nbilities over the space of programs and inputs. Demonstrating how to do this ef\ufb01ciently is our main\ntechnical contribution.\n\n2.1 Formalization as probabilistic inference\n\nWe formalize unsupervised program synthesis as Bayesian inference within the following generative\nmodel: Draw a program f (\u00b7) from a description length prior over programs, which depends upon\nthe sketch. Draw N inputs {Ii}N\ni=1 to the program f (\u00b7) from a domain-dependent description length\nprior PI (\u00b7). These inputs are passed to the program to yield {zi}N\ni=1 with zi (cid:44) f (Ii) (zi \u201cde\ufb01ned\ni=1 by drawing from a noise model Px|z(\u00b7|zi).\nas\u201d f (Ii)). Last, we compute the observed data {xi}N\ni=1 from the observed dataset {xi}N\nOur objective is to estimate the unobserved f (\u00b7) and {Ii}N\ni=1. We\n(cid:16) \u2212 log Px|z(xi|f (Ii))\nuse this probabilistic model to de\ufb01ne the description length below, which we seek to minimize:\n(cid:125)\n(cid:124)\n\n\u2212 log PI (Ii)\n\nN(cid:88)\n\n(cid:123)(cid:122)\n\n(cid:123)(cid:122)\n\n(cid:17)\n\n(cid:125)\n\n+\n\ni=1\n\n(cid:124)\n\n(1)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n\u2212 log Pf (f )\nprogram length\n\n(cid:125)\n\ndata reconstruction error\n\ndata encoding length\n\n2.2 De\ufb01ning a program space\nWe sketch a space of allowed programs by writing down a context free grammar G, and write L to\nmean the set of all programs generated by G. Placing uniform production probabilities over each\nnon-terminal symbol in G gives a PCFG that serves as a prior over programs: the Pf (\u00b7) of Eq. 1.\nFor example, a grammar over arithmetic expressions might contain rules that say: \u201cexpressions are\neither the sum of two expressions, or a real number, or an input variable x\u201d which we write as\n\nE \u2192 E + E | R | x\n\n(2)\nHaving speci\ufb01ed a space of programs, we de\ufb01ne the meaning of a program in terms of SMT primi-\ntives, which can include objects like tuples, real numbers, conditionals, booleans, etc [8]. We write\n\u03c4 to mean the set of expressions built of SMT primitives. Formally, we assume G comes equipped\n\nwith a denotation for each rule, which we write as(cid:74)\u00b7(cid:75) : L \u2192 \u03c4 \u2192 \u03c4. The denotation of a rule in G\n\nis always written as a function of the denotations of that rule\u2019s children. For example, a denotation\nfor the grammar in Eq. 2 is (where I is a program input):\n\n(cid:74)E1 + E2(cid:75)(I) =(cid:74)E1(cid:75)(I) +(cid:74)E2(cid:75)(I)\n\n(cid:74)r \u2208 R(cid:75)(I) = r\n\n(cid:74)x(cid:75)(I) = I\n\n(3)\n\n2\n\n\fDe\ufb01ning the denotations for a grammar is straightforward and analogous to writing a \u201cwrapper\nlibrary\u201d around the core primitives of the SMT solver. Our formalization factors out the grammar and\nthe denotation, but they are tightly coupled and, in other synthesis tools, written down together [7, 9].\nThe denotation shows how to construct an SMT expression from a single program in L, and we\nuse it to build an SMT expression that represents the space of all programs such that its solution\ntells which program in the space solves the synthesis problem. The SMT solver then solves jointly\nfor the program and its inputs, subject to an upper bound upon the total description length. This\nbuilds upon prior work in program synthesis, such as [9], but departs in the quantitative aspect of\nthe constraints and in not knowing the program inputs. Due to space constraints, we only brie\ufb02y\ndescribe the synthesis algorithm, leaving a detailed discussion to the Supplement.\nWe use Algorithm 1 to generate an SMT formula that (1) de\ufb01nes the space of programs L; (2)\ncomputes the description length of a program; and (3) computes the output of a program on a given\ninput. In Algorithm 1 the returned description length l corresponds to the \u2212 log Pf (f ) term of Eq.\n1 while the returned evaluator f (\u00b7) gives us the f (Ii) terms. The returned constraints A ensure that\nthe program computed by f (\u00b7) is a member of L.\nThe SMT formula generated by Algorithm\n1 must be supplemented with constraints\nthat compute the data reconstruction er-\nror and data encoding length of Eq. 1.\nWe handle in\ufb01nitely recursive grammars\nby bounding the depth of recursive calls\nto the Generate procedure, as in [7].\nSMT solvers are not designed to minimize\nloss functions, but to verify the satis\ufb01abil-\nity of a set of constraints. We minimize\nEq. 1 by \ufb01rst asking the solver for any\nsolution, then adding a constraint saying\nits solution must have smaller description\nlength than the one found previously, etc.\nuntil it can \ufb01nd no better solution.\n\nfunction Generate(G,(cid:74)\u00b7(cid:75),P ):\nInput: Grammar G, denotation(cid:74)\u00b7(cid:75), non-terminal P\n\nOutput: Description length l : \u03c4,\nchoices \u2190 {P \u2192 K(P (cid:48), P (cid:48)(cid:48), . . .) \u2208 G}\nn \u2190 |choices|\nfor r = 1 to n do\n\nAlgorithm 1 SMT encoding of programs generated by\nproduction P of grammar G\n\nevaluator f : \u03c4 \u2192 \u03c4, assertions A : 2\u03c4\n\nr, f j\nlj\nend for\n\nj lj\nr\n\nr , Aj\n\nr ) = choices(r)\n\nlet K(P 1\nr , . . . , P k\nfor j = 1 to k do\n\nr \u2190 Generate(G,(cid:74)\u00b7(cid:75),P j\n\nlr \u2190(cid:80)\nlet gr((cid:74)Q1(cid:75)(I),\u00b7\u00b7\u00b7 ,(cid:74)Qk(cid:75)(I)) =\n\n// Denotation is a function of child denotations\n// Let gr be that function for choices(r)\n// Q1,\u00b7\u00b7\u00b7 , Qk : L are arguments to constructor K\n\n3 Experiments\n\n3.1 Visual concept learning\n\nr )\n\n(cid:74)K(Q1, . . . , Qk)(cid:75)(I)\nr (I),\u00b7\u00b7\u00b7 , f k\n\nr (I))\n\nfr(I) \u2190 gr(f 1\n\nj cj\n\nA1 \u2190(cid:87)\nA \u2190 A1 \u222a A2 \u222a(cid:83)\n\nend for\n// Indicator variables specifying which rule is used\n// Fresh variables unused in any existing formula\nc1,\u00b7\u00b7\u00b7 , cn = FreshBooleanVariable()\nA2 \u2190 \u2200j (cid:54)= k : \u00ac(cj \u2227 ck)\nr,j Aj\nr\nl = log n + if(c1, l1, if(c2, l2,\u00b7\u00b7\u00b7 ))\nf (I) = if(c1, f1(I), if(c2, f2(I),\u00b7\u00b7\u00b7 ))\nreturn l, f, A\n\nHumans quickly learn new visual con-\ncepts, often from only a few examples\n[2, 5, 10]. In this section, we present ev-\nidence that an unsupervised program syn-\nthesis approach can also learn visual con-\ncepts from a small number of examples.\nOur approach is as follows: given a set of\nexample images, we automatically parse\nthem into a symbolic form. Then, we\nsynthesize a program that maximally com-\npresses these parses. Intuitively, this pro-\ngram encodes the common structure needed to draw each of the example images.\nWe take our visual concepts from the Synthetic Visual Reasoning Test (SVRT), a set of visual\nclassi\ufb01cation problems which are easily parsed into distinct shapes. Fig. 1 shows three examples\nof SVRT concepts. Fig. 2 diagrams the parsing procedure for another visual concept: two arbitrary\nshapes bordering each other.\nWe de\ufb01ned a space of simple graphics programs that control a turtle [11] and whose primitives\ninclude rotations, forward movement, rescaling of shapes, etc.; see Table 1. Both the learner\u2019s\nobservations and the graphics program outputs are image parses, which have three sections: (1) A\nlist of shapes. Each shape is a tuple of a unique ID, a scale from 0 to 1, and x, y coordinates:\n\n3\n\n\f(cid:104)id, scale, x, y(cid:105). (2) A list of containment relations contains(i, j) where i, j range from one to the\nnumber of shapes in the parse. (3) A list of re\ufb02exive borders relations borders(i, j) where i, j range\nfrom one to the number of shapes in the parse.\nThe algorithm in Section 2.2 describes purely functional programs (programs without state), but\nthe grammar in Table 1 contains imperative commands that modify a turtle\u2019s state. We can think\nof imperative programs as syntactic sugar for purely functional programs that pass around a state\nvariable, as is common in the programming languages literature [7].\nThe grammar of Table 1 leaves unspeci\ufb01ed the number of program inputs. When synthesizing a\nprogram from example images, we perform a grid search over the number of inputs. Given images\nwith N shapes and maximum shape ID D, the grid search considers D input shapes, 1 to N input\npositions, 0 to 2 input lengths and angles, and 0 to 1 input scales. We set the number of imperative\ndraw commands (resp. borders, contains) to N (resp. number of topological relations).\nWe now de\ufb01ne a noise model Px|z(\u00b7|\u00b7) that speci\ufb01es how a program output z produces a parse x,\nby de\ufb01ning a procedure for sampling x given z. First, the x and y coordinates of each shape are\nperturbed by additive noise drawn uniformly from \u2212\u03b4 to \u03b4; in our experiments, we put \u03b4 = 3.\nThen, optional borders and contains relations (see Table 1) are erased with probability 1/2. Last,\nbecause the order of the shapes is unidenti\ufb01able, both the list of shapes and the indices of the\nborders/containment relations are randomly permuted. The Supplement has the SMT encoding of\nthe noise model and priors over program inputs, which are uniform.\n\nteleport(position[0],\n\ninitialOrientation)\n\ndraw(shape[0], scale = 1)\nmove(distance[0], 0deg)\ndraw(shape[0], scale = scale[0])\nmove(distance[0], 0deg)\ndraw(shape[0], scale = scale[0])\n\nFigure 1: Left: Pairs of examples of three SVRT concepts taken from [5]. Right: the program we\nsynthesize from the leftmost pair. This is a turtle program capable of drawing this pair of pictures and\nis parameterized by a set of latent variables: shape, distance, scale, initial position, initial orientation.\n\nTo encourage translational and rotational invariance,\nthe \ufb01rst turtle command is constrained to always be a\nteleport to a new location, and the initial orientation of\nthe turtle, which we write as \u03b80, is made an input to the\nsynthesized graphics program.\nWe are introducing an unsupervised learning algorithm,\nbut the SVRT consists of supervised binary classi\ufb01-\ncation problems. So we chose to evaluate our visual\nconcept learner by having it solve these classi\ufb01cation\nproblems. Given a test image t and a set of exam-\nples E1 (resp. E2) from class C1 (resp. C2), we use\nthe decision rule P (t|E1) (cid:82)C1\nP (t|E2), or equivalently\nPx({t} \u222a E1)Px(E2) (cid:82)C1\nPx(E1)Px({t} \u222a E2). Each\nterm in this decision rule is written as a marginal prob-\nability, and we approximate each marginal by lower\nbounding it by the largest term in its corresponding\nsum. This gives\n\nC2\n\nC2\n\ns1 = Shape(id = 1, scale = 1,\nx = 10, y = 15)\ns2 = Shape(id = 2, scale = 1,\nx = 27, y = 54)\n\nborders(s1, s2)\n\nFigure 2: The parser segments shapes and\nidenti\ufb01es their topological relations (con-\ntains, borders), emmitting their coordi-\nnates, topological relations, and scales.\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n\u2212l({t} \u222a E1)\n\u2248log Px({t}\u222aE1)\n\n(cid:125)\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n\u2212l(E2)\n\u2248log Px(E2)\n\nC1(cid:82)\n\nC2\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n\u2212l(E1)\n\u2248log Px(E1)\n\n4\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n\u2212l({t} \u222a E2)\n\u2248log Px({t}\u222aE2)\n\n(cid:125)\n\n(4)\n\n\fEnglish description\nAlternate move/draw; containment relations; borders relations\nMove turtle to new location R, reset orientation to \u03b80\nRotate by angle A, go forward by distance L\n\nGrammar rule\nE \u2192 (M; D)+; C+; B+\nM \u2192 teleport(R, \u03b80)\nM \u2192 move(L,A)\nM \u2192 flipX()|flipY() Flip turtle over X/Y axis\nM \u2192 jitter()\nSmall perturbation to turtle position\nDraw shape S at scale Z\nD \u2192 draw(S,Z)\nZ \u2192 1|z1|z2|\u00b7\u00b7\u00b7\nScale is either 1 (no rescaling) or program input zj\nA \u2192 0\u25e6| \u00b1 90\u25e6|\u03b81|\u03b82|\u00b7\u00b7\u00b7\nAngle is either 0\u25e6, \u00b190\u25e6, or a program input \u03b8j\nR \u2192 r1|r2|\u00b7\u00b7\u00b7\nPositions are program inputs rj\nS \u2192 s1|s2|\u00b7\u00b7\u00b7\nShapes are program inputs sj\nL \u2192 (cid:96)1|(cid:96)2|\u00b7\u00b7\u00b7\nLengths are program inputs (cid:96)j\nC \u2192 contains(Z, Z)\nContainment between integer indices into drawn shapes\nC \u2192 contains?(Z, Z)\nOptional containment between integer indices into drawn shapes\nB \u2192 borders(Z, Z)\nBordering between integer indices into drawn shapes\nB \u2192 borders?(Z, Z)\nOptional bordering between integer indices into drawn shapes\n\nTable 1: Grammar for the vision domain. The non-terminal E is the start symbol for the grammar.\nThe token ; indicates sequencing of imperative commands. Optional bordering/containment holds\nwith probability 1/2. See the Supplement for denotations of each grammar rule.\n\n(cid:32)(cid:88)\n\ne\u2208E\n\n(cid:33)\n\nwhere l(\u00b7) is\n\nl(E) (cid:44) min\n\n\u2212 log Pf (f ) \u2212\n\nlog PI (Ie) + log Px|z(Ee|f (Ie))\n\nf,{Ie}e\u2208E\n\n(5)\nSo, we induce 4 programs that maximally compress a different set of image parses: E1, E2, E1 \u222a\n{t}, E2 \u222a {t}. The maximally compressive program is found by minimizing Eq. 5, putting the\nobservations {xi} as the image parses, putting the inputs {Ie} as the parameters of the graphics\nprogram, and generating the program f (\u00b7) by passing the grammar of Table 1 to Algorithm 1.\nWe evaluated the classi\ufb01cation accuracy across each of the 23 SVRT problems by sampling three\npositive and negative examples from each class, and then evaluating the accuracy on a held out\ntest example. 20 such estimates were made for each problem. We compare with three baselines, as\nshown in Fig. 3. (1) To control for the effect of our parser, we consider how well discriminative clas-\nsi\ufb01cation on the image parses performs. For each image parse, we extracted the following features:\nnumber of distinct shapes, number of rescaled shapes, and number of containment/bordering rela-\ntions, for 4 integer valued features. Following [5] we used Adaboost with decision stumps on these\nparse features. (2) We trained two convolutional network architectures for each SVRT problem, and\nfound that a variant of LeNet5 [12] did best; we report those results here. The Supplement has the\nnetwork parameters and results for both architectures. (3) In [5] several discriminative baselines\nare introduced. These models are trained on low-level image features; we compare with their best-\nperforming model, which fed 10000 examples to Adaboost with decision stumps. Unsupervised\nprogram synthesis does best in terms of average classi\ufb01cation accuracy, number of SVRT problems\nsolved at \u2265 90% accuracy,1 and correlation with the human data.\nWe do not claim to have solved the SVRT. For example, our representation does not model some ge-\nometric transformations needed for some of the concepts, such as rotations of shapes. Additionally,\nour parsing procedure occasionally makes mistakes, which accounts for the many tasks we solve at\naccuracies between 90% and 100%.\n\n3.2 Morphological rule learning\n\nHow might a language learner discover the rules that in\ufb02ect verbs? We focus on English in\ufb02ectional\nmorphology, a system with a long history of computational modeling [13]. Viewed as an unsuper-\nvised learning problem, our objective is to \ufb01nd a compressive representation of English verbs.\n\n1Humans \u201clearn the task\u201d after seven consecutive correct classi\ufb01cations [5]. Seven correct classi\ufb01cations\n\nare likely to occur when classi\ufb01cation accuracy is \u2265 0.51/7 \u2248 0.9\n\n5\n\n\fFigure 3: Comparing human performance on the\nSVRT with classi\ufb01cation accuracy for machine\nlearning approaches. Human accuracy is the\nfraction of humans that learned the concept: 0%\nis chance level. Machine accuracy is the fraction\nof correctly classi\ufb01ed held out examples: 50% is\nchance level. Area of circles is proportional to\nthe number of observations at that point. Dashed\nline is average accuracy. Program synthesis: this\nwork trained on 6 examples. ConvNet: A variant\nof LeNet5 trained on 2000 examples. Parse (Im-\nage) features: discriminative learners on features\nof parse (pixels) trained on 6 (10000) examples.\nHumans given an average of 6.27 examples and\nsolve an average of 19.85 problems [5].\n\nour\n\nlearner\n\nthe\n\nfollowing simpli\ufb01cation:\n\nWe make\nis presented with triples of\n(cid:104)lexeme, tense, word(cid:105)2. This ignores many of the dif\ufb01culties involved in language acquisi-\ntion, but see [14] for a unsupervised approach to extracting similar information from corpora. We\ncan think of these triples as the entries of a matrix whose columns correspond to different tenses\nand whose rows correspond to different lexemes; see Table 3. We regard each row of this matrix\nas an observation (the {xi} of Eq. 1) and identify stems with the inputs to the program we are to\nsynthesize (the {Ii} of Eq. 1). Thus, our objective is to synthesize a program that maps a stem to a\ntuple of in\ufb02ections. We put a description length prior over the stem and detail its SMT encoding in\nthe the Supplement. We represent words as sequences of phonemes, and de\ufb01ne a space of programs\nthat operate upon words, given in Table 2.\nEnglish in\ufb02ectional verb morphology has a set of regular rules that apply for almost all words, as\nwell as a small set of words whose in\ufb02ections do not follow a regular rule: the \u201cirregular\u201d forms.\nWe roll these irregular forms into the noise model: with some small probability \u0001, an in\ufb02ected form\nis produced not by applying a rule to the stem, but by drawing a sequence of phonemes from a\ndescription length prior. In our experiments, we put \u0001 = 0.1. This corresponds to a simple \u201crules\nplus lexicon\u201d model of morphology, which is oversimpli\ufb01ed in many respects but has been proposed\nin the past as a crude approximation to the actual system of English morphology [13]. See the\nSupplement for the SMT encoding of our noise model.\nIn conclusion, the learning problem is as follows: given triples of (cid:104)lexeme, tense, word(cid:105), jointly infer\nthe regular rules, the stems, and which words are irregular exceptions.\nWe took \ufb01ve in\ufb02ected forms of the top 5000 lexemes as measured by token frequency in the CELEX\nlexical inventory [15]. We split this in half to give 2500 lexemes for training and testing, and\ntrained our model using Random Sample Consensus (RANSAC) [16]. Concretely, we sampled many\nsubsets of the data, each with 4, 5, 6, or 7 lexemes (thus 20, 25, 30, or 35 words), and synthesized\nthe program for each subset minimizing Eq. 1. We then took the program whose likelihood on the\ntraining set was highest. Fig. 4 plots the likelihood on the testing set as a function of the number of\nsubsets (RANSAC iterations) and the size of the subsets (# of lexemes). Fig. 5 shows the program\nthat assigned the highest likelihood to the training data; it also had the highest likelihood on the\ntesting data. With 7 lexemes, the learner consistently recovers the regular linguistic rule, but with\nless data, it recovers rules that are almost as good, degrading more as it receives less data.\nMost prior work on morphological rule learning falls into two regimes: (1) supervised learning of\nthe phonological form of morphological rules; and (2) unsupervised learning of morphemes from\ncorpora. Because we learn from the lexicon, our model is intermediate in terms of supervision. We\ncompare with representative systems from both regimes as follows:\n\n2The lexeme is the meaning of the stem or root; for example, run, ran, runs all share the same lexeme\n\n6\n\n\fGrammar rule\nE \u2192 (cid:104)C,\u00b7\u00b7\u00b7 ,C(cid:105)\nC \u2192 R|if (G) R else C\nR \u2192 stem + phoneme\u2217\nG \u2192 [VPMS]\nV \u2192 V(cid:48)|?\nV(cid:48) \u2192 VOICED|UNVOICED\nP \u2192 P(cid:48)|?\nP(cid:48) \u2192 LABIAL|\u00b7\u00b7\u00b7\nM \u2192 M(cid:48)|?\nM(cid:48) \u2192 FRICATIVE|\u00b7\u00b7\u00b7\nS \u2192 S(cid:48)|?\nS(cid:48) \u2192 SIBILANT|NOTSIBIL\n\nEnglish description\nPrograms are tuples of conditionals, one for each tense\nConditionals have return value R, guard G, else condition C\nReturn values append a suf\ufb01x to a stem\nGuards condition upon voicing, manner, place, sibilancy\nVoicing speci\ufb01es of voice V(cid:48) or doesn\u2019t care\nVoicing options\nPlace speci\ufb01es a place of articulation P(cid:48) or doesn\u2019t care\nPlace of articulation features\nManner speci\ufb01es a manner of articulation M(cid:48) or doesn\u2019t care\nManner of articulation features\nSibilancy speci\ufb01es a sibilancy S(cid:48) or doesn\u2019t care\nSibilancy is a binary feature\n\nTable 2: Grammar for the morphology domain. The non-terminal E is the start symbol for\nthe grammar. Each guard G conditions on phonological properties of the end of the stem:\nvoicing, place, manner, and sibilancy.\nSequences of phonemes are encoded as tuples of\n(cid:104)length, phoneme1, phoneme2,\u00b7\u00b7\u00b7(cid:105). See the Supplement for denotations of each grammar rule.\n\nLexeme\nstyle\nrun\nsubscribe\nrack\n\nPresent\nstaIl\nr2n\ns@bskraIb\nr\u00e6k\n\nPast\nstaIld\nr\u00e6n\ns@bskraIbd\nr\u00e6kt\n\n3rd Sing. Pres.\nstaIlz\nr2nz\ns@bskraIbz\nr\u00e6ks\n\nPast Part.\nstaIld\nr2n\ns@bskraIbd\nr\u00e6kt\n\nProg.\nstaIlIN\nr2nIN\ns@bskraIbIN\nr\u00e6kIN\n\nTable 3: Example input to the morphological rule learner\n\nThe Morfessor system [17] induces morphemes from corpora which it then uses for segmentation.\nWe used Morfessor to segment phonetic forms of the in\ufb02ections of our 5000 lexemes; compared\nto the ground truth in\ufb02ection transforms provided by CELEX, it has an error rate of 16.43%. Our\nmodel segments the same verbs with an error rate of 3.16%. This experiment is best seen as a sanity\ncheck: because our system knows a priori to expect only suf\ufb01xes and knows which words must share\nthe same stem, we expect better performance due to our restricted hypothesis space. To be clear, we\nare not claiming that we have introduced a stemmer that exceeds or even meets the state-of-the-art.\nIn [1] Albright and Hayes introduce a supervised morphological rule learner that induces phonolog-\nical rules from examples of a stem being transformed into its in\ufb02ected form. Because our model\nlearns a joint distribution over all of the in\ufb02ected forms of a lexeme, we can use it to predict in\ufb02ec-\ntions conditioned upon their present tense. Our model recovers the regular in\ufb02ections, but does not\nrecover the so-called \u201cislands of reliability\u201d modeled in [1]; e.g., our model predicts that the past\ntense of the nonce word glee is gleed, but does not predict that a plausible alternative past tense is\ngled, which the model of Albright and Hayes does. This de\ufb01ciency is because the space of programs\nin Table 2 lacks the ability to express this class of rules.\n\n4 Discussion\n\n4.1 Related Work\n\nInductive programming systems have a long and rich history [4]. Often these systems use stochastic\nsearch algorithms, such as genetic programming [18] or MCMC [19]. Others suf\ufb01ciently constrain\nthe hypothesis space to enable fast exact inference [20]. The inductive logic programming com-\nmunity has had some success inducing Prolog programs using heuristic search [4]. Our work is\nmotivated by the recent successes of systems that put program synthesis in a probabilistic frame-\nwork [21, 22]. The program synthesis community introduced solver-based methods for learning\nprograms [7, 23, 9], and our work builds upon their techniques.\n\n7\n\n\fPRESENT = stem\nPAST\n\n= i f\n\n[ CORONAL STOP ]\n\nPROG .\n3 r d S i n g = i f\n\n= stem + IN\n\n[ SIBILANT ]\n\nstem + Id\n\ni f\n\n[ VOICED ]\n\nstem + d\n\ne l s e\n\nstem + t\n\nstem + Iz\n\ni f\n\n[ VOICED ]\n\nstem + z\n\ne l s e\n\nstem + s\n\nFigure 5: Program synthesized by morphol-\nogy learner. Past Participle program was the\nsame as past tense program.\n\nFigure 4: Learning curves for our morphol-\nogy model trained using RANSAC. At each\niteration, we sample 4, 5, 6, or 7 lexemes\nfrom the training data, \ufb01t a model using\ntheir in\ufb02ections, and keep the model if it has\nhigher likelihood on the training data than\nother models found so far. Each line was run\non a different permutation of the samples.\n\nThere is a vast literature on computational models of morphology. These include systems that learn\nthe phonological form of morphological rules [1, 13, 24], systems that induce morphemes from\ncorpora [17, 25], and systems that learn the productivity of different rules [26]. In using a general\nframework, our model is similar in spirit to the early connectionist accounts [24], but our use of\nsymbolic representations is more in line with accounts proposed by linguists, like [1].\nOur model of visual concept learning is similar to inverse graphics, but the emphasis upon synthe-\nsizing programs is more closely aligned with [2].We acknowledge that convolutional networks are\nengineered to solve classi\ufb01cation problems qualitatively different from the SVRT, and that one could\ndesign better neural network architectures for these problems. For example, it would be interesting\nto see how the very recent DRAW network [27] performs on the SVRT.\n\n4.2 A limitation of the approach: Large datasets\n\nSynthesizing programs from large datasets is dif\ufb01cult, and complete symbolic solvers often do not\ndegrade gracefully as the problem size increases. Our morphology learner uses RANSAC to sidestep\nthis limitation, but we anticipate domains for which this technique will be insuf\ufb01cient. Prior work in\nprogram synthesis introduced Counter Example Guided Inductive Synthesis (CEGIS) [7] for learn-\ning from a large or possibly in\ufb01nite family of examples, but it cannot accomodate noise in the data.\nWe suspect that a hypothetical RANSAC/CEGIS hybrid would scale to large, noisy training sets.\n\n4.3 Future Work\n\nThe two key ideas in this work are (1) the encoding of soft probabilistic constraints as hard con-\nstraints for symbolic search, and (2) crafting a domain speci\ufb01c grammar that serves both to guide\nthe symbolic search and to provide a good inductive bias. Without a strong inductive bias, one can-\nnot possibly generalize from a small number of examples. Yet humans can, and AI systems should,\nlearn over time what constitutes a good prior, hypothesis space, or sketch. Learning a good inductive\nbias, as done in [22], and then providing that inductive bias to a solver, may be a way of advancing\nprogram synthesis as a technology for arti\ufb01cial intelligence.\n\nAcknowledgments\n\nWe are grateful for discussions with Timothy O\u2019Donnell on morphological rule learners, for advice\nfrom Brendan Lake and Tejas Kulkarni on the convolutional network baselines, and for the sugges-\ntions of our anonymous reviewers. This material is based upon work supported by funding from\nNSF award SHF-1161775, from the Center for Minds, Brains and Machines (CBMM) funded by\nNSF STC award CCF-1231216, and from ARO MURI contract W911NF-08-1-0242.\n\n8\n\n\fReferences\n[1] Adam Albright and Bruce Hayes. Rules vs. analogy in english past tenses: A computational/experimental\n\nstudy. Cognition, 90:119\u2013161, 2003.\n\n[2] Brenden M Lake, Ruslan R Salakhutdinov, and Josh Tenenbaum. One-shot learning by inverting a com-\npositional causal process. In Advances in neural information processing systems, pages 2526\u20132534, 2013.\n[3] Noah D. Goodman, Vikash K. Mansinghka, Daniel M. Roy, Keith Bonawitz, and Joshua B. Tenenbaum.\n\nChurch: a language for generative models. In UAI, pages 220\u2013229, 2008.\n\n[4] Sumit Gulwani, Jose Hernandez-Orallo, Emanuel Kitzelmann, Stephen Muggleton, Ute Schmid, and Ben\n\nZorn. Inductive programming meets the real world. Commun. ACM, 2015.\n\n[5] Franc\u00b8ois Fleuret, Ting Li, Charles Dubout, Emma K Wampler, Steven Yantis, and Donald Geman. Com-\n\nparing machines and humans on a visual categorization test. PNAS, 108(43):17621\u201317625, 2011.\n\n[6] Ray J Solomonoff. A formal theory of inductive inference. Information and control, 7(1):1\u201322, 1964.\n[7] Armando Solar Lezama. Program Synthesis By Sketching. PhD thesis, EECS Department, University of\n\nCalifornia, Berkeley, Dec 2008.\n\n[8] Leonardo De Moura and Nikolaj Bj\u00f8rner. Z3: An ef\ufb01cient smt solver. In Tools and Algorithms for the\n\nConstruction and Analysis of Systems, pages 337\u2013340. Springer, 2008.\n\n[9] Emina Torlak and Rastislav Bodik. Growing solver-aided languages with rosette. In Proceedings of the\n2013 ACM international symposium on New ideas, new paradigms, and re\ufb02ections on programming &\nsoftware, pages 135\u2013152. ACM, 2013.\n\n[10] Stanislas Dehaene, V\u00b4eronique Izard, Pierre Pica, and Elizabeth Spelke. Core knowledge of geometry in\n\nan amazonian indigene group. Science, 311(5759):381\u2013384, 2006.\n[11] David D. Thornburg. Friends of the turtle. Compute!, March 1983.\n[12] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, November 1998.\n\n[13] Mark S Seidenberg and David C Plaut. Quasiregularity and its discontents: the legacy of the past tense\n\ndebate. Cognitive science, 38(6):1190\u20131228, 2014.\n\n[14] Erwin Chan and Constantine Lignos. Investigating the relationship between linguistic representation and\ncomputation through an unsupervised model of human morphology learning. Research on Language and\nComputation, 8(2-3):209\u2013238, 2010.\n\n[15] R Piepenbrock Baayen, R and L Gulikers. CELEX2 LDC96L14. Philadelphia: Linguistic Data Consor-\n\ntium, 1995. Web download.\n\n[16] Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model \ufb01tting with\n\napplications to image analysis and automated cartography. Commun. ACM, 24(6):381\u2013395, June 1981.\n\n[17] Sami Virpioja, Peter Smit, Stig-Arne Grnroos, and Mikko Kurimo. Morfessor 2.0: Python implementation\n\nand extensions for morfessor baseline. Technical report, Aalto University, Helsinki, 2013.\n\n[18] John R. Koza. Genetic programming - on the programming of computers by means of natural selection.\n\nComplex adaptive systems. MIT Press, 1993.\n\n[19] Eric Schkufza, Rahul Sharma, and Alex Aiken. Stochastic superoptimization. In ACM SIGARCH Com-\n\nputer Architecture News, volume 41, pages 305\u2013316. ACM, 2013.\n\n[20] Sumit Gulwani. Automating string processing in spreadsheets using input-output examples. In POPL,\n\npages 317\u2013330, New York, NY, USA, 2011. ACM.\n\n[21] Yarden Katz, Noah D. Goodman, Kristian Kersting, Charles Kemp, and Joshua B. Tenenbaum. Modeling\n\nsemantic cognition as logical dimensionality reduction. In CogSci, pages 71\u201376, 2008.\n\n[22] Percy Liang, Michael I. Jordan, and Dan Klein. Learning programs: A hierarchical bayesian approach.\n\nIn Johannes F\u00a8urnkranz and Thorsten Joachims, editors, ICML, pages 639\u2013646. Omnipress, 2010.\n\n[23] Sumit Gulwani, Susmit Jha, Ashish Tiwari, and Ramarathnam Venkatesan. Synthesis of loop-free pro-\n\ngrams. In PLDI, pages 62\u201373, New York, NY, USA, 2011. ACM.\n\n[24] D. E. Rumelhart and J. L. McClelland. On learning the past tenses of english verbs.\n\nIn Parallel dis-\ntributed processing: Explorations in the microstructure of cognition, pages Volume 2, 216\u2013271. Bradford\nBooks/MIT Press, 1986.\n\n[25] John Goldsmith. Unsupervised learning of the morphology of a natural language. Comput. Linguist.,\n\n27(2):153\u2013198, June 2001.\n\n[26] Timothy J. O\u2019Donnell. Productivity and Reuse in Language: A Theory of Linguistic Computation and\n\nStorage. The MIT Press, 2015.\n\n[27] Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. DRAW: A recurrent neural network for\n\nimage generation. CoRR, abs/1502.04623, 2015.\n\n9\n\n\f", "award": [], "sourceid": 620, "authors": [{"given_name": "Kevin", "family_name": "Ellis", "institution": "MIT"}, {"given_name": "Armando", "family_name": "Solar-Lezama", "institution": "MIT"}, {"given_name": "Josh", "family_name": "Tenenbaum", "institution": "MIT"}]}