{"title": "Automatic Derivation of Statistical Algorithms: The EM Family and Beyond", "book": "Advances in Neural Information Processing Systems", "page_first": 689, "page_last": 696, "abstract": null, "full_text": "Automatic Derivation of Statistical Algorithms:\n\nThe EM Family and Beyond\n\nAlexander G. Gray\n\nCarnegie Mellon University\n\nagray@cs.cmu.edu\n\nBernd Fischer and Johann Schumann\n\nRIACS / NASA Ames\n\n@email.arc.nasa.gov\n\n fisch,schumann\nWray Buntine\n\nHelsinki Institute for IT\nbuntine@hiit.fi\n\nAbstract\n\nMachine learning has reached a point where many probabilistic meth-\nods can be understood as variations, extensions and combinations of a\nmuch smaller set of abstract themes, e.g., as different instances of the\nEM algorithm. This enables the systematic derivation of algorithms cus-\ntomized for different models. Here, we describe the AUTO BAYES sys-\ntem which takes a high-level statistical model speci\ufb01cation, uses power-\nful symbolic techniques based on schema-based program synthesis and\ncomputer algebra to derive an ef\ufb01cient specialized algorithm for learning\nthat model, and generates executable code implementing that algorithm.\nThis capability is far beyond that of code collections such as Matlab tool-\nboxes or even tools for model-independent optimization such as BUGS\nfor Gibbs sampling: complex new algorithms can be generated with-\nout new programming, algorithms can be highly specialized and tightly\ncrafted for the exact structure of the model and data, and ef\ufb01cient and\ncommented code can be generated for different languages or systems.\nWe present automatically-derived algorithms ranging from closed-form\nsolutions of Bayesian textbook problems to recently-proposed EM algo-\nrithms for clustering, regression, and a multinomial form of PCA.\n\n1 Automatic Derivation of Statistical Algorithms\n\nOverview. We describe a symbolic program synthesis system which works as a \u201cstatistical\nalgorithm compiler:\u201d it compiles a statistical model speci\ufb01cation into a custom algorithm\ndesign and from that further down into a working program implementing the algorithm\ndesign. This system, AUTOBAYES, can be loosely thought of as \u201cpart theorem prover, part\nMathematica, part learning textbook, and part Numerical Recipes.\u201d It provides much more\n\ufb02exibility than a \ufb01xed code repository such as a Matlab toolbox, and allows the creation\nof ef\ufb01cient algorithms which have never before been implemented, or even written down.\nAUTOBAYES is intended to automate the more routine application of complex methods in\nnovel contexts. For example, recent multinomial extensions to PCA [2, 4] can be derived\nin this way.\n\n\u0001\n\fThe algorithm design problem. Given a dataset and a task, creating a learning method\ncan be characterized by two main questions: 1. What is the model? 2. What algorithm will\noptimize the model parameters? The statistical algorithm (i.e., a parameter optimization\nalgorithm for the statistical model) can then be implemented manually. The system in\nthis paper answers the algorithm question given that the user has chosen a model for the\ndata,and continues through to implementation. Performing this task at the state-of-the-art\nlevel requires an intertwined meld of probability theory, computational mathematics, and\nsoftware engineering. However, a number of factors unite to allow us to solve the algorithm\ndesign problem computationally: 1. The existence of fundamental building blocks (e.g.,\nstandardized probability distributions, standard optimization procedures, and generic data\nstructures). 2. The existence of common representations (i.e., graphical models [3, 13] and\nprogram schemas). 3. The formalization of schema applicability constraints as guards.1\nThe challenges of algorithm design. The design problem has an inherently combinatorial\nnature, since subparts of a function may be optimized recursively and in different ways.\nIt also involves the use of new data structures or approximations to gain performance. As\nthe research in statistical algorithms advances, its creative focus should move beyond the\nultimately mechanical aspects and towards extending the abstract applicability of already\nexisting schemas (algorithmic principles like EM), improving schemas in ways that gener-\nalize across anything they can be applied to, and inventing radically new schemas.\n\n2 Combining Schema-based Synthesis and Bayesian Networks\n\nwith 0 < n_points;\n\n1 model mog as \u2019Mixture of Gaussians\u2019;\n\nwith 0 < n_classes\nwith n_classes << n_points;\n\nwith 1 = sum(I := 1..n_classes, phi(I));\n\n7 double phi(1..n_classes) as \u2019weights\u2019\n8\n9 double mu(1..n_classes);\n9 double sigma(1..n_classes);\n\n2 const int n_points as \u2019nr. of data points\u2019\n3\n4 const int n_classes := 3 as \u2019nr. classes\u2019\n5\n6\n\nStatistical Models.\nExternally,\nAUTOBAYES has the look and feel of\na compiler. Users specify their model\nof interest in a high-level speci\ufb01cation\nlanguage (as opposed to a program-\nming language). The \ufb01gure shows the\nspeci\ufb01cation of the mixture of Gaus-\nsians example used throughout this\npaper.2 Note the constraint that the\nsum of the class probabilities must\nequal one (line 8) along with others\n(lines 3 and 5) that make optimization\nof the model well-de\ufb01ned. Also note\nthe ability to specify assumptions of\nthe kind in line 6, which may be used by some algorithms. The last line speci\ufb01es the goal\n\n10 int c(1..n_points) as \u2019class labels\u2019;\n11 c \u02dc disc(vec(I := 1..n_classes, phi(I)));\n\n12 data double x(1..n_points) as \u2019data\u2019;\n13 x(I) \u02dc gauss(mu(c(I)), sigma(c(I)));\n\n14 max pr(x| phi,mu,sigma\u0001 ) wrt  phi,mu,sigma\u0001 ;\n\u0005\u0007\u0006\ninference task: maximize the conditional probability pr\u0002\u0004\u0003\nrameters \u0003\n\n, and \u0003\n\n, \u0003\n\n\b\n\t\n\n. Note that moving the parameters across to the left of the conditioning\n\nbar converts this from a maximum likelihood to a maximum a posteriori problem.\nComputational logic and theorem proving. Internally, AUTOBAYES uses a class of tech-\nniques known as computational logic which has its roots in automated theorem proving.\nAUTOBAYES begins with an initial goal and a set of initial assertions, or axioms, and adds\nnew assertions, or theorems, by repeated application of the axioms, until the goal is proven.\nIn our context, the goal is given by the input model; the derived algorithms are side effects\nof constructive theorems proving the existence of algorithms for the goal.\n\n\u0001\u000e\r with respect to the pa-\n\n1Schema guards vary widely; for example, compare Nead-Melder simplex or simulated anneal-\ning (which require only function evaluation), conjugate gradient (which require both Jacobian and\nHessian), EM and its variational extension [6] (which require a latent-variable structure model).\n\n2Here, keywords have been underlined and line numbers have been added for reference in the text.\nThe as-keyword allows annotations to variables which end up in the generated code\u2019s comments.\nAlso, n classes has been set to three (line 4), while n points is left unspeci\ufb01ed. The class\nvariable and single data variable are vectors, which de\ufb01nes them as i.i.d.\n\n\n\u0003\n\u0003\n\u000b\n\t\n\u0003\n\f\n\b\n\u000b\n\f\n\fComputer algebra. The \ufb01rst core element which makes automatic algorithm derivation\nfeasible is the fact that we can mechanize the required symbol manipulation, using com-\nputer algebra methods. General symbolic differentiation and expression simpli\ufb01cation are\ncapabilities fundamental to our approach. AUTO BAYES contains a computer algebra en-\ngine using term rewrite rules which are an ef\ufb01cient mechanism for substitution of equal\nquantities or expressions and thus well-suited for this task. 3\nSchema-based synthesis. The computational cost of full-blown theorem proving grinds\nsimple tasks to a halt while elementary and intermediate facts are reinvented from scratch.\nTo achieve the scale of deduction required by algorithm derivation, we thus follow a\nschema-based synthesis technique which breaks away from strict theorem proving. Instead,\nwe formalize high-level domain knowledge, such as the general EM strategy, as schemas.\nA schema combines a generic code fragment with explicitly speci\ufb01ed preconditions which\ndescribe the applicability of the code fragment. The second core element which makes\nautomatic algorithm derivation feasible is the fact that we can use Bayesian networks to\nef\ufb01ciently encode the preconditions of complex algorithms such as EM.\nFirst-order logic representation of Bayesian net-\nworks. A \ufb01rst-order logic representation of Bayesian\nnetworks was developed by Haddawy [7].\nIn this\nframework, random variables are represented by\nfunctor symbols and indexes (i.e., speci\ufb01c instances\nof i.i.d. vectors) are represented as functor arguments.\nSince unknown index values can be represented by\nimplicitly universally quanti\ufb01ed Prolog variables, this\napproach allows a compact encoding of networks involving i.i.d. variables or plates [3]; the\n\ufb01gure shows the initial network for our running example. Moreover, such networks cor-\nrespond to backtrack-free datalog programs, allowing the dependencies to be ef\ufb01ciently\ncomputed. We have extended the framework to work with non-ground probability queries\nsince we seek to determine probabilities over entire i.i.d. vectors and matrices. Tests for in-\ndependence on these indexed Bayesian networks are easily developed in Lauritzen\u2019s frame-\nwork which uses ancestral sets and set separation [9] and is more amenable to a theorem\nprover than the double negatives of the more widely known d-separation criteria. Given a\nBayesian network, some probabilities can easily be extracted by enumerating the compo-\nnent probabilities at each node:\n\nx\ngauss\n\ndiscrete\n\nNclasses\n\nN\n\nclasses\n\nNpoints\n\nc\n\ngraph iff the following probability statement holds:\n\n\t\u0002\u0001\nLemma 1. Let \n\u0003 descendents\u0002\t\n\n\u0001\u0006\u0005\b\u0007 . Then\nhold 4in the corresponding dependency\n\nbe sets of variables over a Bayesian network with \u0004\u0003\n\u0005\n\u0007 and parents\u0002\u000b\n\u000e\u0010\u000f&\u0011('\u0010\u0015 parents\u0011(')\u0019\u001e\u0019+*\n\n\u000e\u0010\u000f\u0012\u0011\u0014\u0013\u0016\u0015\u0018\u0017\u001a\u0019\u001c\u001b\u001d\u000e\u0010\u000f\u0012\u0011\u0014\u0013\u0016\u0015 parents\u0011\u0014\u0013\u0010\u0019\u001e\u0019\u001f\u001b! \n\"$#&%\n\n\f\n\nSymbolic probabilistic inference. How can probabilities not satisfying these conditions\nbe converted to symbolic expressions? While many general schemes for inference on net-\nworks exist, our principal hurdle is the need to perform this over symbolic expressions in-\ncorporating real and integer variables from disparate real or in\ufb01nite-discrete distributions.\nFor instance, we might wish to compute the full maximum a posteriori probability for\nthe mean and variance vectors of a Gaussian mixture model under a Bayesian framework.\nWhile the sum-product framework of [8] is perhaps closer to our formulation, we have out\nof necessity developed another scheme that lets us extract probabilities on a large class of\nmixed discrete and real, potentially indexed variables, where no integrals are needed and\n\n3Popular symbolic packages such as Mathematica contain known errors allowing unsound deriva-\n\ntions; they also lack the support for reasoning with vector and matrix quantities.\n\n4Note that \u0013-, descendents\u0011\u0014\u0013\u0010\u0019\u001c\u001b\u001d. and \u0013/, parents\u0011\u0014\u0013\u0010\u0019\u001f\u001b\u001d. .\n\ns\nm\nf\n\u0001\n\n\u0001\n\fall marginalization is done by summing out discrete variables. We give the non-indexed\ncase below; this is readily extended to indexed variables (i.e., vectors).\nLemma 2.\n\nis independent of \niff there exists a set of variables \u0001 such that Lemma 1 holds if we replace \n\u0003 . Moreover, the unique minimal set \u0001 satisfying these conditions is given by\nancestors\u0002\t\n\r\u0005\u0004\nLemma 3. Let \u0001\n\r is independent\nof \u0002\u000b\n and \u0001\n\u0001\n\b&\u0001\n satisfying these\n\n\u0003 descendents\u0002\u000b\n\u0002 ancestors\u0002\n be a subset of \u0001\t\b descendents\u0002\u000b\n\u0002 ancestors\u0002\nby \u0001\n\n such that ancestors\u0002\n . Moreover, there is a unique maximal set \u0001\n\n\u0005\u0006\u0007 holds and ancestors\u0002\n\r\u0007\u0006\n\r given \u0001\n\n . Then Lemma 2 holds if we replace \n\nconditions.\nLemma 2 lets us evaluate a probability by a summation:\n\ngiven\nby\n\nby\n\n\u000e\u0010\u000f&\u0011\u0014\u0013\n\n\u0015\u0018\u0017\n\n\u0019\u001c\u001b\n\n\u000e\u0010\u000f&\u0011\u0014\u0013\n\u0012)\u001b\n\n'\u0013\u0012\u0015\u0014+\u0013\n\n\u0015\u0018\u0017\n\n\"\r\f(# Dom\u000e\n\n%\u000f\f\u0011\u0010\n\u000e\u0010\u000f&\u0011\u0014\u0013\u0017\u0016\n\u0017\u0019\u0018\n\u000e\u0010\u000f\u0012\u0011\n\nwhile Lemma 3 lets us evaluate a probability by a summation and a ratio:\n\n\u0015\u0018\u0017\nSince the lemmas also show minimality of the sets \u001a and \u0001\n\b&\u0001\n\n\u000e\u0010\u000f&\u0011\u0014\u0013\n\n\u0017\u0019\u0018\n\n\u0015\u0018\u0017\n\n\u0019\u001f\u001b\n\n\u0015\u0018\u0017\n\nconditions under which a probability can be evaluated by discrete summation without inte-\ngration. These inference lemmas are operationalized as network decomposition schemas.\nHowever, we usually attempt to decompose a probability into independent components\nbefore applying this schema.\n\n , they also give the minimal\n\n3 The AUTOBAYES System \u2014 Implementation Outline\n\n\u0002\t\n\nLevels of representation. Internally, our system uses three conceptually different levels of\nrepresentation. Probabilities (including logarithmic and conditional probabilities) are the\nmost abstract level. They are processed via methods for Bayesian network decomposition\nor match with core algorithms such as EM. Formulae are introduced when probabilities of\n\n\u0006 parents\u0002\u000b\n\n are detected, either in the initial network, or after the appli-\n\nthe form \u001b\u0003\u001c\ncation of network decompositions. Atomic probabilities (i.e., \n\nis a single variable) are\ndirectly replaced by formulae based on the given distribution and its parameters. General\nprobabilities are decomposed into sums and products of the respective atomic probabili-\nties. Formulae are ready for immediate optimization using symbolic or numeric methods\nbut sometimes they can be decomposed further into independent subproblems. Finally, we\nuse imperative intermediate code as the lowest level to represent both program fragments\nwithin the schemas as well as the completely constructed programs. All transformations\nwe apply operate on or between these levels.\nTransformations for optimization. A number of different kinds of transformations are\navailable. Decomposition of a problem into independent subproblems is always done. De-\ncomposition of probabilities is driven by the Bayesian network; we have a separate system\nfor handling decomposition of formulae. A formula can be decomposed along a loop, e.g.,\n\n\u001f for !\n\nfor \u001e \u001f\"!\n\nthe problem \u201coptimize \u0003\n\u201coptimize \u001d\ninto the two subprograms \u201coptimize \u001d\n\n .\u201d More commonly, \u201coptimize \u001d\n\n \u201d is transformed into a for-loop over subproblems\n\r \u201d is transformed\nfor !\n\r .\u201d The lemmas\n\nfor !\n\r \u201d and \u201coptimize \b\n\ngiven earlier are applied to change the level of representation and are thus used for simpli\ufb01-\ncation of probabilities. Examples of general expression simpli\ufb01cation include simplifying\nthe log of a formula, moving a summation inwards, and so on. When necessary, symbolic\ndifferentiation is performed. In the initial speci\ufb01cation or in intermediate representations,\n\n$#&%\nfor %\n\n\u0001\n\n\u0001\n\n\u0001\n\n\u0002\n\u0001\n\n\u0002\n\u0001\n\u0001\n\n\u0002\n\u0001\n\n\b\n\u0002\n\u0001\n\n\u0001\n\n\n\n\u0002\n\u000b\n\u0019\n\u0017\n\u0012\n\u0012\n\u0019\n\u0017\n\u0012\n\u0012\n\u0019\n\n\u001d\n\u0002\n\u001d\n\u001f\n\u0002\n\u001d\n\u001f\n\t\n\b\n\u0002\n\u001d\n\u0002\n\b\n\u0002\n\u001d\n\u0002\n\b\n\f\u0005\u0004\u0003\n\n and mean\u0002\n\n ) are identi\ufb01ed and sim-\n\r . The statistical\n\nlikelihoods (i.e., subexpressions of the form \u0002\u0001\n\u001b\u0003\u001c\npli\ufb01ed into linear expression with terms such as mean\u0002\n\nalgorithm schemas currently implemented include EM, k-means, and discrete model se-\nlection. Adding a Gibbs sampling schema would yield functionality comparable to that of\nBUGS [14]. Usually, the schemas require a particular form of the probabilities involved;\nthey are thus tightly coupled to the decomposition and simpli\ufb01cation transformations. For\nis\n\nexample, EM is a way of dealing with situation where Lemma 2 applies but where \n\nindexed identically to the data.\nCode and test generation. From the intermediate code, code in a particular target lan-\nguage may be generated. Currently, AUTOBAYES can generate C++ and C which can be\nused in a stand-alone fashion or linked into Octave or Matlab (as a mex \ufb01le). During this\ncode-generation phase, most of the vector and matrix expressions are converted into for-\nloops, and various code optimizations are performed which are impossible for a standard\ncompiler. Our tool does not only generate ef\ufb01cient code, but also highly readable, doc-\numented programs: model- and algorithm-speci\ufb01c comments are generated automatically\nduring the synthesis phase. For most examples, roughly 30% of the produced lines are\ncomments. These comments provide explanation of the algorithm\u2019s derivation. A gener-\nated HTML software design document with navigation capabilities facilitates code under-\nstanding and reading. AUTOBAYES also automatically generates a program for sampling\nfrom the speci\ufb01ed model, so that closed-loop testing with synthetic data of the assumed\ndistributions can be done. This can be done using simple forward sampling.\n\n4 Example: Deriving the EM Algorithm for Gaussian Mixtures\n\n1. User speci\ufb01es model. First, the user speci\ufb01es the model as shown in Section 2.\n2. System parses model to obtain underlying Bayes net. From the model, the underlying\nBayesian network is derived and represented internally as a directed graph. For visualiza-\ntion, AUTOBAYES can also produce a graph drawing as shown in Section 2.\n3. System observes hidden-variable structure in Bayesian network. The system at-\ntempts to decompose the optimization goal into independent parts, but \ufb01nds that it cannot.\nHowever, it then \ufb01nds that the probability in the initial optimization statement matches the\nconditions of Lemma 2 and that the network describes a latent variable model.\n4. System invokes abstract EM-\nfamily schema. This triggers the\nEM-schema, whose overall structure\nis shown. The syntactic structure of\nthe current subproblem must match\nthe \ufb01rst argument of the schema;\nif additional applicability constraints\n(not shown here) hold, this schema is\n\n\u0014\u0006\u0005\n\u0019\b\u0007\n\t\n\u0015\u0018\u0017\n\u0019 wrt \u0017\nschema\u0011 max Pr\u0011\u0014\u0013\nC = \u201d[initialize\u000b\n\u0019\u0006\u001d\nwhile\u0011\r\f\n\u0017\u001a\u0019\u001c\u001b\n\u000e\u0010\u000f\u0012\u0011\u0014\u0013\u0014\u0015\u0010\u0016\u0018\u0017\u001a\u0019\u0012\u0011\u0014\u0017)\u0011\n/* M-step */ \f max Pr\u0011\n\u0014+\u0013\u0016\u0015\u0018\u0017\u001a\u0019 wrt \u0017\u001e\u001b ;\n/* E-step */ \f calculate Pr\u0011\n\u001f \u201d\nexecuted. It constructs a piece of code which is returned in the variable \nment can contain recursive calls to other schemas (denoted by !\nsubproblems which then is inserted into the schema, such as converging, a generic con-\n. Note that the schema actually\nvergence criterion here imposed over the variables \u0003\nimplements an ME-algorithm (i.e., starts the loop with the M-step) because the initial-\nization already serves as an E-step. The system identi\ufb01es the discrete variable \u0003\n# as the\n. For representation of the distribution of the hidden\nvariable a matrix \u0003\nis the probability that the ( -th point falls into\nthe ) -th class. AUTOBAYES then constructs the new distribution c(I) \u02dc disc(vec(J\n\n:= 1..n classes, q(I, J)) which replaces the original distribution in the following\nrecursive calls of AUTOBAYES.\n\n\" ) which return code for\n\nis generated, where %\n\nsingle hidden variable, i.e.,\n\n. This code frag-\n\n\u0019\u001c\u001b ;\n\n];\n\n\u0006\r\u0006\n\n. . .\n\n\u001f'&\n\n%\n\u001e\n\u001f\n\u0002\n\u0005\n\u001f\n\u0006\n\u001d\n\u0005\n\u001f\n\u001f\n\n\u000b\n\u000b\n\u0015\n\u0013\n\u0014\n\u0017\n\u0006\n\u000b\n\t\n\u0003\n\f\n\t\n\u0003\n\b\n$\n\u0005\n\n\u0003\n#\n\u0001\n%\n\fwhile\n\nconverging\n\nfor\u001f\f\u000b\u000e\r\u0010\u000f\u0012\u0011\nfor&\u0010\u000b\u000e\r\u0010\u000f\u0012\u0013\n\u000b Pr\n\u0014\u0016\u0015\u0018\u0017\n\u001a\u0019\n\nmax Pr\n\n\u0003\u0005\u0004\n\u0002\u0001\n&\u001c\u001b\n\n\u0007\t\b\n\b\n\u0003\u0005\u0004\n\n\b \u001f\n\u0007\t\b wrt #\u0001\n\u0003\u0005\u0004\n\n\u0015\u001e\u0004\n\n\u0003\"!\n\nimplies that #\n\n\u001f can be eliminated from\n$ .\n\n5. E-step: System performs marginaliza-\ntion. The freshly introduced distribution for\nthe objective function by summing over %\nThis gives us the partial program shown in\nthe internal pseudocode.\n6. M-step: System recursively decom-\nposes optimization problem. AUTOBAYES\nis recursively called with the new goal\n\u0014\t9\n\u001f .\nmax\nNow, the Bayesian network decomposition\n\n\u0019 wrt \u001d\n\nPr\u0011\u001c\u001d\u00059\n\n\u0014#9\n\n\u0014;9\n\n\u0014\t9\n\n6,7\u001c8\n\n,\n\nPRQ\n\n\u0001\u000e\n\nwhile\n\n\u0003\t\u00171\u0004\n\n\u0014\t9\n\n.\n\nmax\n\nmax\n\nE\u001aFHG\n\n%5M\n\n\u0014#9\n\n\u0014;9\n\nN0FHG+O\n\nconverging\n\n\b wrt \n\nE)FHG Pr\u0011\n\n\b \u001f\n\u0003\t\u00171\u0004\n\u0017 wrt \n\u0014\t9\n\u0014;9\n\nare co-indexed, unrolling proceeds over both (also independent and identically distributed)\n\nis independent of\n, thus the optimization problem can be decomposed into two optimization subproblems:\n\n\u0015\u0010\u001d\n\u0014\f9\nschema applies with \n\b\n\t\n, revealing that \u0003\n\u0019 wrt \u001d\n\u001f .\n\n\u0007\t\b\n\b\n\u0003\u0005\u0004\n\u0002\u0001\nfor\u001f\f\u000b\u000e\r\u0010\u000f\u0012\u0011\nfor&\u0010\u000b\u000e\r\u0010\u000f\u0012\u0013\n&\u001c\u001b\n\u000b Pr\n\u0015\u0018\u0017\n\u0003\u0005\u0004\n\u001a\u0019\nfor&\u0010\u000b\u000e\r\u0010\u000f\u0012\u0013\n.0/ Pr\n\u0014\u0016\u0015,\u0017\f-\n\u0015)(+*\n%'&\n\b \u0007\n\u0014\u0016\u0015\u0018\u0017\n\u0015)(+*\n\u0017 (+*\n4%5&\n%32\nmax Pr\u0011@9\n\u0014#9\n\u0015\u0002\u001d\u00059\n\u0014;9\n\u001f and max Pr\u0011B9\n\u0019 wrt \u001dA9\n7. System unrolls i.i.d. vectors. The \ufb01rst subgoal from the decomposition schema,\nmax Pr\u001119\n\u0015\u0006\u001d\u00059\n\u0014\f9\n\u0014;9\n\u0019 wrt \u001dA9\n\u001f , can be unrolled over the independent and identically dis-\n\u0005 using an index decomposition schema which moves expressions out of\ntributed vector \u0003\n# and \u0003\nloops (sums or products) when they are not dependent on the loop index. Since \u0003\nvectors in parallel: max \u001eDC\n8. System identi\ufb01es and solves Gaussian elimination problem. The probability Pr\u0002\nI\u001aJ#K\n\n\u0015\u0002\u001d\nis atomic because parents\u0002\n\n\u001f .\nIt can thus be replaced by the\n\nindex decomposition allows solution for the two scalars \u000b\n\nappropriately instantiated Gaussian density function. Because the strictly monotone\n\u0002\u0002L\nfunction can \ufb01rst be applied to the objective function of the maximization, it becomes\n\u001f . Another application of\nmax\n& . Gaussian elimination\nis then used to solve this subproblem analytically, yielding the sequence of expressions\n& .\n\n\u0014\t9\n\u0019 wrt \u001dA9\n& and \f\n& and \f\n\u001f)\u000bX\r\n\u001f\u001a\u000bX\r\n9. System identi\ufb01es and solves Lagrange multiplier problem. The second subgoal\nmax Pr\u0011B9\n\u001f can be unrolled over the i.i.d. vector \u0003\n# as before. The speci\ufb01ca-\n\u0005[Z creates a constrained maximization problem in the vector \u0003\ntion condition\nwhich is solved by an application of the Lagrange multiplier schema. This in turn results\n& and for the multiplier which are both solved\n\nsymbolically. Thus, the usual EM algorithm for Gaussian mixtures is derived.\n10. System checks and optimizes pseudocode.\nDuring the synthesis process,\nAUTOBAYES accumulates a number of constraints which have to hold to ensure proper\noperation of the code (e.g., absence of divide-by-zero errors). Unless these constraints can\n\nin two subproblems for a single instance \b\n\n\u0019 wrt \u001d\n&B\u000bX\n\n\u0019 wrt \u001dA9\n\nbe resolved against the model (e.g., \f\n\n\u001f]\\_^ ), AUTOBAYES automatically inserts run-time\nchecks into the code. Before \ufb01nally generating C/C++ code, the pseudocode is optimized\n\u0005`Z ) and the domain. Thus, opti-\nusing information from the speci\ufb01cation (e.g.,\nmizations well beyond the capability of a regular compiler can be done.\n11. System translates pseudocode to real code in desired language.\nFinally,\nAUTOBAYES converts the intermediate code into code of the desired target system. The\nsource code contains thorough comments detailing the mathematics implemented. A reg-\nular compiler containing generic performance optimizations not repeated by AUTO BAYES\nturns the code into an executable program. A program for sampling from a mixture of\nGaussians is also produced for testing purposes.\n\n&B\u000bX\n\n6\u00187S8UT\n\n\u001f\u001a\u000bX\n\n\u001f)\u000bX\n\nV\u0012W\n\n6\u00187S8\n\n\u001f'&\n\n\u001f\u0010Y\n\n\n\u0001\n\u0006\n\u0004\n\u0001\n\n\u0015\n\u000b\n\u001d\n\u0001\n\u0001\n\u0006\n\n\u001d\n\u0015\n\u0004\n\u0019\n\u0015\n\u001b\n\u0015\n\u0004\n\u0006\n!\n\u0015\n\u0004\n\u0001\n\u0001\n\u0006\n\u0004\n\u0001\n\u0007\n\u0001\n\u0001\n#\n\u001f\n\u001f\n\u0004\n\n\u0001\n\u0006\n\u0004\n\u0001\n\n\u0014\n\u0015\n\u000b\n\u001d\n\u0015\n\u0004\n\u0001\n\u0001\n\u0006\n\n\u001d\n\u0015\n\u001b\n\u0006\n\u0017\n\u0006\n\u0017\n\u0001\n\u0001\n\u0007\n\u0001\n\u0001\n:\n<\n\u001f\n9\n=\n>\n?\n\u001f\n9\n=\n>\n?\n\u0005\n\n\u0003\n#\n\t\n\u0003\n\u0005\n\u0001\n\u0001\n\u0005\n\n\u0003\n\u0003\n\u000b\n\t\n\u0003\n\f\n\u0001\n\b\n\u0003\n\f\n\t\n\u0003\n\u000b\n<\n:\n>\n?\n\u001f\n>\n?\n:\n\u0015\n9\n=\n9\n=\n<\n:\n>\n?\n\u001f\n>\n?\n\u0005\n<\nE\n:\nE\n>\n?\n\u001f\n>\n?\n\u0005\n\u001f\n\u0006\n\n#\n\u001f\n\t\n\u0003\n\u000b\n\t\n\u0003\n\f\n\u0005\n\u001f\n\n\u0005\n\n#\n\u001f\n\t\n\u0003\n\u000b\n\t\n\u0003\n\f\n\u0001\n\n%\nC\nE\nN\n\u0011\n\t\nG\n\u0017\n\u0011\n<\nE\n\t\n>\nN\n\u0019\nP\n\t\n\t\n?\nN\n>\n?\n\u000b\n&\n\u0005\n%\n\u0011\n%\n\u001f\n&\n\u0005\n\u001f\n\b\n%\n\u0011\n%\n\u001f\n&\n\u0005\n%\n\u0011\n%\n\u0002\n\u0005\n\u000b\n&\n\n\u0003\n\b\n%\n\u0011\n%\n\u001f\n:\n\u0015\n9\n=\n9\n=\n%\n\u0013\n\b\n&\n\b\n%\n\u0013\n\b\n&\n\f5 Range of Capabilities\n\nHere, we discuss 18 examples which have been successfully handled by AUTO BAYES,\nranging from simple textbook examples to sophisticated EM models and recent multino-\nmial versions of PCA. For each entry, the table below gives a brief description, the number\nof lines of the speci\ufb01cation and synthesized C++ code (loc), and the runtime to generate the\ncode (in secs., measured on a 2.2GHz Linux system). Correctness was checked for these\nexamples using automatically-generated test data and hand-written implementations.\nBayesian textbook examples. Simple textbook examples, like Gaussian with simple prior\n, or Gaussian with conjugate prior \u0002\u0001 have\nclosed-form solutions. The symbolic system of AUTO BAYES can actually \ufb01nd these solu-\ntions and thus generate short and ef\ufb01cient code. However, a slight relaxation of the prior\n\n , Gaussian with inverse gamma prior \n(Gaussian with semi-conjugate prior, \u0004\u0003 ) requires an iterative numerical solver.\non \u000b\n\nGaussians in action.\nis a Gaussian change-detection model. A slight extension of\nour running example, integrating several features, yields a Gaussian Bayes classi\ufb01er model\nhas been successfully tested on various standard benchmarks [1], e.g., the Abalone\n\ndataset. Currently, the number of expected classes has to be given in advance.\n\n. \u0005\n\n)\n\nMixture models and EM. A wide range of \u0006 -Gaussian mixture models can be handled by\n\r ) and 2D with diagonal covariance (\u0007\nAUTOBAYES, ranging from the simple 1D (\u0007\n\u0001 and with (conjugate) priors on mean \u0007\nto 1D models for multi-dimensional classes \u0007\nor variance \u0007\t\b . Using only a slight variation in the speci\ufb01cation, the Gaussian distribu-\ntion can be replaced by other distributions (e.g., exponentials, \u0007\u000b\n , for failure analysis) or\ncombinations (e.g., . Gaussian and Beta, \u0007\r\f , or \u0006 -Cauchy and Poisson \u0007\r\u000e ). In the algo-\nrithm generated by \u0007\u000f\f , the analytic subsolution for the Gaussian case is combined with the\nnumerical solver. Finally, \u0007\u000f\u0010\n\n-Gaussians two-level hierarchical\nmixture model which is solved by a nested instantiation of EM [15]: i.e., the M-step of the\nouter EM algorithm is a second EM algorithm nested inside.\nMixtures for Regression. We represented regression with Gaussian error and Legendre\npolynomials with full conjugate priors allowing smoothing [10]. Two versions of this were\nreplaces the Gaussian error with a mixture of two\nreplaces\nthe single regression curve by a mixture of several curves [5]. In both cases an EM algo-\nrithm is correctly integrated with the exact regression solutions.\nPrincipal Component Analysis. We also tested a multinomial version of PCA called latent\nDirichlet allocation [2]. AUTOBAYES currently lacks variational support, yet it manages to\n\nthen done: robust linear regression \u0011\nGaussians (one broad, one peaked) both centered at zero. Trajectory clustering \u0011\n\n -Gaussians and \u0006\n\nis a \u0006\n\ncombine a \u0006 -means style outer loop on the component proportions with an EM-style inner\n\nloop on the hidden counts, producing the original algorithm of Hofmann, Lee and Seung,\nand others [4].\n\n#\n\n\u0016,+\n\n46+\n4:9\n46<\n\nDescription\n\n\u0014\u001b\u001a\n\n>\u0014\u0019\n\n\u0019\u001d\u001c\nQ.-\n\n\u0011#\"*$\n\n>\u0018\u0017 N\u0011\n\u0019\u001d\u001c\n>\u0018\u0017 N\u0011\n\u0017\t\u001f\n&2(\nGauss step-detect\n5 -Gauss mix 1D\n\u2013\u201d\u2013, multi-dim\n\u2013\u201d\u2013,? prior\n5\u0005G , 5\n\nGauss/Beta mix\n\nrob. lin. regression\nmixture regression\n\n\u0014)\"%$\n\nloc\n12/137\n\n16/188\n\n19/662\n17/418\n24/900\n21/442\n22/834\n29/1053\n54/1877\n53/1282\n\n\u0012\u0014\u0013\n0.2\n\n0.4\n\n2.0\n0.7\n1.1\n0.9\n1.7\n2.3\n14.5\n9.8\n\n#\n\nDescription\n\n\u0016,0\n\n470\n478\n46;\n\n\u0011#\"%$\n\u0014)\"*$\n\u0017\t\u001f! \nP'&\t(\n>1\u0017 N\u0011\n\u0014\u001b\u001a\n\u0014)\"*$\n\u0011#\"%$\n\u0017\t\u001f\n&\t(\nGauss Bayes Classify\n5 -Gauss mix 2D, diag\n\u2013\u201d\u2013 1D,> prior\n5 -Exp mix\n5 -Cauchy/Poisson\nmix\nPCA mult/w 5 -means\n\nloc\n13/148\n\n\u0012\u0015\u0013\n0.2\n\n17/233\n\n0.4\n\n58/1598\n22/599\n25/456\n15/347\n21/747\n\n4.7\n1.2\n1.0\n0.5\n1.0\n\n26/390\n\n1.2\n\nP -Gauss hierarch\n\n\n\u0003\n\u0005\n\n\u0005\n\u0003\n\u0003\n\u0003\n\u0003\n\u0003\n\n\u0003\n\u0016\nG\n\u001e\n\u0019\n\u0019\n\u0016\nP\n>\n?\nP\n?\nP\nG\nP\n?\n\u001c\n\u001e\n\u0019\n\u0019\n>\n\u0019\n\u0014\n\u0011\n/\n$\n\u0019\n\u001e\n\u0019\n>\n\u0019\n\u0019\n\u0019\n?\nP\n \nG\nP\nP\n?\n\u001c\n\u001e\n\u0019\n\u0019\n?\nP\n \nG\nP\nP\n?\n\u001c\n\u001e\n\u0019\n\u0019\n3\nG\n3\nP\n4\nG\n4\nP\n4\n\u001e\n=\nG\n\u000e\nG\n=\nP\n\f6 Conclusion\n\nBeyond existing systems. Code libraries are common in statistics and learning, but they\nlack the high level of automation achievable only by deep symbolic reasoning. The Bayes\nNet Toolbox [12] is a Matlab library which allows users to program in models but does not\nderive algorithms or generate code. The B UGS system [14] also allows users to program\nin models but is specialized for Gibbs sampling. Stochastic parametrized grammars [11]\nallow a concise model speci\ufb01cation similar to AUTOBAYES\u2019s speci\ufb01cation language, but\nare currently only a notational device similar to XML.\nBene\ufb01ts of automated algorithm and code generation. Industrial-strength code. Code\ngenerated by AUTOBAYES is ef\ufb01cient, validated, and commented. Extreme applications.\nExtremely complex or critical applications such as spacecraft challenge the reliability lim-\nits of human-developed software. Automatically generated software allows for pervasive\ncondition checking and correctness-by-construction. Fast prototyping and experimenta-\ntion. For both the data analyst and machine learning researcher, AUTO BAYES can function\nas a powerful experimental workbench. New complex algorithms. Even with only the few\nelements implemented so far, we showed that algorithms approaching research-level results\n[4, 5, 10, 15] can be automatically derived. As more distributions, optimization methods\nand generalized learning algorithms are added to the system, an exponentially growing\nnumber of complex new algorithms become possible, including non-trivial variants which\nmay challenge any single researcher\u2019s particular algorithm design expertise.\nFuture agenda. The ultimate goal is to give researchers the ability to experiment with the\nentire space of complex models and state-of-the-art statistical algorithms, and to allow new\nalgorithmic ideas, as they appear, to be implicitly generalized to every model and special\ncase known to be applicable. We have already begun work on generalizing the EM schema\nto continuous hidden variables, as well as adding schemas for variational methods, fast\nkd-tree and\nAvailability. A web interface for AUTOBAYES is currently under development. More\ninformation is available at http://ase.arc.nasa.gov/autobayes.\n\n-body algorithms, MCMC, and temporal models.\n\nReferences\n\n[1] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998.\n\n[2] D. Blei, A.Y. Ng, and M. Jordan. Latent Dirichlet allocation. NIPS*14, 2002.\n\n[3] W.L. Buntine. Operations for learning with graphical models. JAIR, 2:159\u2013225, 1994.\n\n[4] W.L. Buntine. Variational extensions to EM and multinomial PCA. ECML 2002, pp. 23\u201334, 2002.\n\n[5] G.S. Gaffney and P. Smyth. Trajectory clustering using mixtures of regression models. In 5th KDD, pp. 63\u201372, 1999.\n\n[6] Z. Ghahramani and M.J. Beal. Propagation algorithms for variational Bayesian learning. In NIPS*12, pp. 507\u2013513, 2000.\n\n[7] P. Haddawy. Generating Bayesian Networks from Probability Logic Knowledge Bases. In UAI 10, pp. 262\u2013269, 1994.\n\n[8] F. R. Kschischang, B. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEE Trans. Inform. Theory,\n\n47(2):498\u2013519, 2001.\n\n[9] S.L. Lauritzen, A.P. Dawid, B.N. Larsen, and H.-G. Leimer. Independence properties of directed Markov \ufb01elds. Networks,\n\n20:491\u2013505, 1990.\n\n[10] D.J.C. Mackay. Bayesian interpolation. Neural Computation, 4(3):415\u2013447, 1991.\n\n[11] E. Mjolsness and M. Turmon. Stochastic parameterized grammars for Bayesian model composition. In NIPS*2000 Work-\n\nshop on Software Support for Bayesian Analysis Systems, Breckenridge, December 2000.\n\n[12] K. Murphy. Bayes Net Toolbox for Matlab. Interface of Computing Science and Statistics 33, 2001.\n\n[13] P. Smyth, D. Heckerman, and M. Jordan. Probabilistic independence networks for hidden Markov models. Neural Compu-\n\ntation, 9(2):227\u2013269, 1997.\n\n[14] A. Thomas, D.J. Spiegelhalter, and W.R. Gilks. BUGS: A program to perform Bayesian inference using Gibbs sampling.\n\nIn Bayesian Statistics 4, pp. 837\u2013842, 1992.\n\n[15] D.A. van Dyk. The nested EM algorithm. Statistica Sinica, 10:203-225, 2000.\n\n\n\f", "award": [], "sourceid": 2176, "authors": [{"given_name": "Bernd", "family_name": "Fischer", "institution": null}, {"given_name": "Johann", "family_name": "Schumann", "institution": null}, {"given_name": "Wray", "family_name": "Buntine", "institution": null}, {"given_name": "Alexander", "family_name": "Gray", "institution": null}]}