{"title": "Spectral Methods for Learning Multivariate Latent Tree Structure", "book": "Advances in Neural Information Processing Systems", "page_first": 2025, "page_last": 2033, "abstract": "This work considers the problem of learning the structure of multivariate linear tree models, which include a variety of directed tree graphical models with continuous, discrete, and mixed latent variables such as linear-Gaussian models, hidden Markov models, Gaussian mixture models, and Markov evolutionary trees.  The setting is one where we only have samples from certain observed variables in the tree, and our goal is to estimate the tree structure (i.e., the graph of how the underlying hidden variables are connected to each other and to the observed variables).  We propose the Spectral Recursive Grouping algorithm, an efficient and simple bottom-up procedure for recovering the tree structure from independent samples of the observed variables.  Our finite sample size bounds for exact recovery of the tree structure  reveal certain natural dependencies on underlying statistical and structural properties of the underlying joint distribution.  Furthermore, our sample complexity guarantees have no explicit dependence on the dimensionality of the observed variables, making the algorithm applicable to many high-dimensional settings.  At the heart of our algorithm is a spectral quartet test for determining the relative topology of a quartet of variables from second-order statistics.", "full_text": "Spectral Methods for\n\nLearning Multivariate Latent Tree Structure\n\nAnimashree Anandkumar\n\nUC Irvine\n\nKamalika Chaudhuri\n\nUC San Diego\n\nDaniel Hsu\n\nMicrosoft Research\n\na.anandkumar@uci.edu\n\nkamalika@cs.ucsd.edu\n\ndahsu@microsoft.com\n\nSham M. Kakade\n\nMicrosoft Research &\n\nUniversity of Pennsylvania\nskakade@microsoft.com\n\nLe Song\n\nCarnegie Mellon University\n\nTong Zhang\n\nRutgers University\n\nlesong@cs.cmu.edu\n\ntzhang@stat.rutgers.edu\n\nAbstract\n\nThis work considers the problem of learning the structure of multivariate linear\ntree models, which include a variety of directed tree graphical models with contin-\nuous, discrete, and mixed latent variables such as linear-Gaussian models, hidden\nMarkov models, Gaussian mixture models, and Markov evolutionary trees. The\nsetting is one where we only have samples from certain observed variables in the\ntree, and our goal is to estimate the tree structure (i.e., the graph of how the under-\nlying hidden variables are connected to each other and to the observed variables).\nWe propose the Spectral Recursive Grouping algorithm, an ef\ufb01cient and simple\nbottom-up procedure for recovering the tree structure from independent samples\nof the observed variables. Our \ufb01nite sample size bounds for exact recovery of\nthe tree structure reveal certain natural dependencies on underlying statistical and\nstructural properties of the underlying joint distribution. Furthermore, our sample\ncomplexity guarantees have no explicit dependence on the dimensionality of the\nobserved variables, making the algorithm applicable to many high-dimensional\nsettings. At the heart of our algorithm is a spectral quartet test for determining the\nrelative topology of a quartet of variables from second-order statistics.\n\n1\n\nIntroduction\n\nGraphical models are a central tool in modern machine learning applications, as they provide a\nnatural methodology for succinctly representing high-dimensional distributions. As such, they have\nenjoyed much success in various AI and machine learning applications such as natural language\nprocessing, speech recognition, robotics, computer vision, and bioinformatics.\nThe main statistical challenges associated with graphical models include estimation and inference.\nWhile the body of techniques for probabilistic inference in graphical models is rather rich [1], current\nmethods for tackling the more challenging problems of parameter and structure estimation are less\ndeveloped and understood, especially in the presence of latent (hidden) variables. The problem of\nparameter estimation involves determining the model parameters from samples of certain observed\nvariables. Here, the predominant approach is the expectation maximization (EM) algorithm, and\nonly rather recently is the understanding of this algorithm improving [2, 3]. The problem of structure\nlearning is to estimate the underlying graph of the graphical model. In general, structure learning is\nNP-hard and becomes even more challenging when some variables are unobserved [4]. The main\napproaches for structure estimation are either greedy or local search approaches [5, 6] or, more\nrecently, based on convex relaxation [7].\n\n1\n\n\fz1\n\nz2\n\nh g\n\nz3\n\nz4\n\nz1\n\nz3\n\nh g\n\nz2\n\nz4\n\nz1\n\nz4\n\nh g\n\nz2\n\nz3\n\n{{z1, z2},{z3, z4}}\n\n(a)\n\n{{z1, z3},{z2, z4}}\n\n(b)\n\n{{z1, z4},{z2, z3}}\n\n(c)\n\nz1\n\nz2\n\nh\n\nz4\nz3\n{{z1, z2, z3, z4}}\n\n(d)\n\nFigure 1: The four possible (undirected) tree topologies over leaves {z1, z2, z3, z4}.\n\nThis work focuses on learning the structure of multivariate latent tree graphical models. Here, the\nunderlying graph is a directed tree (e.g., hidden Markov model, binary evolutionary tree), and only\nsamples from a set of (multivariate) observed variables (the leaves of the tree) are available for\nlearning the structure. Latent tree graphical models are relevant in many applications, ranging from\ncomputer vision, where one may learn object/scene structure from the co-occurrences of objects to\naid image understanding [8]; to phylogenetics, where the central task is to reconstruct the tree of life\nfrom the genetic material of surviving species [9].\nGenerally speaking, methods for learning latent tree structure exploit structural properties afforded\nby the tree that are revealed through certain statistical tests over every choice of four variables in the\ntree. These quartet tests, which have origins in structural equation modeling [10, 11], are hypothesis\ntests of the relative con\ufb01guration of four (possibly non-adjacent) nodes/variables in the tree (see\nFigure 1); they are also related to the four point condition associated with a corresponding additive\ntree metric induced by the distribution [12]. Some early methods for learning tree structure are based\non the use of exact correlation statistics or distance measurements (e.g., [13, 14]). Unfortunately,\nthese methods ignore the crucial aspect of estimation error, which ultimately governs their sample\ncomplexity. Indeed, this (lack of) robustness to estimation error has been quanti\ufb01ed for various\nalgorithms (notably, for the popular Neighbor Joining algorithm [15, 16]), and therefore serves as a\nbasis for comparing different methods. Subsequent work in the area of mathematical phylogenetics\nhas focused on the sample complexity of evolutionary tree reconstruction [17, 15, 18, 19]. The basic\nmodel there corresponds to a directed tree over discrete random variables, and much of the recent\neffort deals exclusively in the regime for a certain model parameter (the Kesten-Stigum regime [20])\nthat allows for a sample complexity that is polylogarithmic in the number of leaves, as opposed\nto polynomial [18, 19]. Finally, recent work in machine learning has developed structure learning\nmethods for latent tree graphical models that extend beyond the discrete distributions of evolutionary\ntrees [21], thereby widening their applicability to other problem domains.\nThis work extends beyond previous studies, which have focused on latent tree models with either\ndiscrete or scalar Gaussian variables, by directly addressing the multivariate setting where hidden\nand observed nodes may be random vectors rather than scalars. The generality of our techniques\nallows us to handle a much wider class of distributions than before, both in terms of the conditional\nindependence properties imposed by the models (i.e., the random vector associated with a node need\nnot follow a distribution that corresponds to a tree model), as well as other characteristics of the node\ndistributions (e.g., some nodes in the tree could have discrete state spaces and others continuous, as\nin a Gaussian mixture model).\nWe propose the Spectral Recursive Grouping algorithm for learning multivariate latent tree structure.\nThe algorithm has at its core a multivariate spectral quartet test, which extends the classical quar-\ntet tests for scalar variables by applying spectral techniques from multivariate statistics (speci\ufb01cally\ncanonical correlation analysis [22, 23]). Spectral methods have enjoyed recent success in the context\nof parameter estimation [24, 25, 26, 27]; our work shows that they are also useful for structure learn-\ning. We use the spectral quartet test in a simple modi\ufb01cation of the recursive grouping algorithm\nof [21] to perform the tree reconstruction. The algorithm is essentially a robust method for reasoning\nabout the results of quartet tests (viewed simply as hypothesis tests); the tests either con\ufb01rm or reject\nhypotheses about the relative topology over quartets of variables. By carefully choosing which tests\nto consider and properly interpreting their results, the algorithm is able to recover the correct latent\ntree structure (with high probability) in a provably ef\ufb01cient manner, in terms of both computational\nand sample complexity. The recursive grouping procedure is similar to the short quartet method\nfrom phylogenetics [15], which also guarantees ef\ufb01cient reconstruction in the context of evolution-\nary trees. However, our method and analysis applies to considerably more general high-dimensional\nsettings; for instance, our sample complexity bound is given in terms of natural correlation con-\n\n2\n\n\fditions that generalize the more restrictive effective depth conditions of previous works [15, 21].\nFinally, we note that while we do not directly address the question of parameter estimation, prov-\nable parameter estimation methods may derived using the spectral techniques from [24, 25].\n\n2 Preliminaries\n\n2.1 Latent variable tree models\nLet T be a connected, directed tree graphical model with leaves Vobs := {x1, x2, . . . , xn} and\ninternal nodes Vhid := {h1, h2, . . . , hm} such that every node has at most one parent. The leaves\nare termed the observed variables and the internal nodes hidden variables. Note that all nodes in\nthis work generally correspond to multivariate random vectors; we will abuse terminology and still\nrefer to these random vectors as random variables. For any h \u2208V hid, let ChildrenT(h) \u2286V T denote\nthe children of h in T.\nEach observed variable x \u2208V obs is modeled as random vector in Rd, and each hidden variable\nh \u2208V hid as a random vector in Rk. The joint distribution over all the variables VT := Vobs \u222a\nVhid is assumed satisfy conditional independence properties speci\ufb01ed by the tree structure over the\nvariables. Speci\ufb01cally, for any disjoint subsets V1, V2, V3 \u2286V T such that V3 separates V1 from V2\nin T, the variables in V1 are conditionally independent of those in V2 given V3.\n\n2.2 Structural and distributional assumptions\n\nThe class of models considered are speci\ufb01ed by the following structural and distributional assump-\ntions.\nCondition 1 (Linear conditional means). Fix any hidden variable h \u2208V hid. For each hidden child\ng \u2208 ChildrenT(h) \u2229V hid, there exists a matrix A(g|h) \u2208 Rk\u00d7k such that\n\nE[g|h] = A(g|h)h;\n\nand for each observed child x \u2208 ChildrenT(h) \u2229V obs, there exists a matrix C(x|h) \u2208 Rd\u00d7k such\nthat\n\nE[x|h] = C(x|h)h.\n\nWe refer to the class of tree graphical models satisfying Condition 1 as linear tree models. Such\nmodels include a variety of continuous and discrete tree distributions (as well as hybrid combinations\nof the two, such as Gaussian mixture models) which are widely used in practice. Continuous linear\ntree models include linear-Gaussian models and Kalman \ufb01lters. In the discrete case, suppose that\nthe observed variables take on d values, and hidden variables take k values. Then, each variable is\nrepresented by a binary vector in {0, 1}s, where s = d for the observed variables and s = k for\nthe hidden variables (in particular, if the variable takes value i, then the corresponding vector is the\ni-th coordinate vector), and any conditional distribution between the variables is represented by a\nlinear relationship. Thus, discrete linear tree models include discrete hidden Markov models [25]\nand Markovian evolutionary trees [24].\nIn addition to the linearity, the following conditions are assumed in order to recover the hidden tree\nstructure. For any matrix M, let \u03c3t(M ) denote its t-th largest singular value.\nCondition 2 (Rank condition). The variables in VT = Vhid \u222aV obs obey the following rank condi-\ntions.\n\n1. For all h \u2208V hid, E[hh\uffff] has rank k (i.e., \u03c3k(E[hh\uffff]) > 0).\n2. For all h \u2208V hid and hidden child g \u2208 ChildrenT(h) \u2229V hid, A(g|h) has rank k.\n3. For all h \u2208V hid and observed child x \u2208 ChildrenT(h) \u2229V obs, C(x|h) has rank k.\n\nThe rank condition is a generalization of parameter identi\ufb01ability conditions in latent variable mod-\nels [28, 24, 25] which rules out various (provably) hard instances in discrete variable settings [24].\n\n3\n\n\fT1\n\nh2\n\nT2\n\nh3\n\nT3\n\nx6\n\nh1\n\nx3\n\nx4\n\nx5\n\nx1\n\nx2\n\nFigure 2: Set of trees Fh4 = {T1,T2,T3} obtained if h4 is removed.\n\nCondition 3 (Non-redundancy condition). Each hidden variable has at least three neighbors. Fur-\nthermore, there exists \u03c12\n\nmax > 0 such that for each pair of distinct hidden variables h, g \u2208V hid,\n\ndet(E[hg\uffff])2\n\ndet(E[hh\uffff]) det(E[gg\uffff]) \u2264 \u03c12\n\nmax < 1.\n\nThe requirement for each hidden node to have three neighbors is natural; otherwise, the hidden\nnode can be eliminated. The quantity \u03c1max is a natural multivariate generalization of correlation.\nFirst, note that \u03c1max \u2264 1, and that if \u03c1max = 1 is achieved with some h and g, then h and g are\ncompletely correlated, implying the existence of a deterministic map between hidden nodes h and\ng; hence simply merging the two nodes into a single node h (or g) resolves this issue. Therefore\nthe non-redundancy condition simply means that any two hidden nodes h and g cannot be further\nreduced to a single node. Clearly, this condition is necessary for the goal of identifying the correct\ntree structure, and it is satis\ufb01ed as soon as h and g have limited correlation in just a single direction.\nPrevious works [13, 29] show that an analogous condition ensures identi\ufb01ability for general latent\ntree models (and in fact, the conditions are identical in the Gaussian case). Condition 3 is therefore\na generalization of this condition suitable for the multivariate setting.\nOur learning guarantees also require a correlation condition that generalize the explicit depth condi-\ntions considered in the phylogenetics literature [15, 24]. To state this condition, \ufb01rst de\ufb01ne Fh to be\nthe set of subtrees of that remain after a hidden variable h \u2208V hid is removed from T (see Figure 2).\nAlso, for any subtree T \uffff of T, let Vobs[T \uffff] \u2286V obs be the observed variables in T \uffff.\nCondition 4 (Correlation condition). There exists \u03b3min > 0 such that for all hidden variables h \u2208\nVhid and all triples of subtrees {T1,T2,T3}\u2286F h in the forest obtained if h is removed from T,\n\nmax\n\nx1\u2208Vobs[T1],x2\u2208Vobs[T2],x3\u2208Vobs[T3]\n\nmin\n\n{i,j}\u2282{1,2,3}\n\n\u03c3k(E[xix\uffffj ]) \u2265 \u03b3min.\n\nThe quantity \u03b3min is related to the effective depth of T, which is the maximum graph distance\nbetween a hidden variable and its closest observed variable [15, 21]. The effective depth is at most\nlogarithmic in the number of variables (as achieved by a complete binary tree), though it can also be\na constant if every hidden variable is close to an observed variable (e.g., in a hidden Markov model,\nthe effective depth is 1, even though the true depth, or diameter, is m + 1). If the matrices giving\nthe (conditionally) linear relationship between neighboring variables in T are all well-conditioned,\nthen \u03b3min is at worst exponentially small in the effective depth, and therefore at worst polynomially\nsmall in the number of variables.\nFinally, also de\ufb01ne\n\n\u03b3max :=\n\nmax\n\n{x1,x2}\u2286Vobs{\u03c31(E[x1x\uffff2 ])}\n\nto be the largest spectral norm of any second-moment matrix between observed variables. Note\n\u03b3max \u2264 1 in the discrete case, and, in the continuous case, \u03b3max \u2264 1 if each observed random\nvector is in isotropic position.\nIn this work, the Euclidean norm of a vector x is denoted by \uffffx\uffff, and the (induced) spectral norm\nof a matrix A is denoted by \uffffA\uffff, i.e., \uffffA\uffff := \u03c31(A) = sup{\uffffAx\uffff : \uffffx\uffff = 1}.\n\n4\n\nh4\fAlgorithm 1 SpectralQuartetTest on observed variables {z1, z2, z3, z4}.\nInput: For each pair {i, j}\u2282{ 1, 2, 3, 4}, an empirical estimate \u02c6\u03a3i,j of the second-moment matrix\nOutput: Either a pairing {{zi, zj},{zi\uffff, zj\uffff}} or \u22a5.\n1: if there exists a partition of {z1, z2, z3, z4} = {zi, zj}\u222a{ zi\uffff, zj\uffff} such that\n\nE[ziz\uffffj ] and a corresponding con\ufb01dence parameter \u2206i,j > 0.\n\n[\u03c3s( \u02c6\u03a3i,j) \u2212 \u2206i,j]+[\u03c3s( \u02c6\u03a3i\uffff,j\uffff) \u2212 \u2206i\uffff,j\uffff]+ >\n\nk\uffffs=1\nthen return the pairing {{zi, zj},{zi\uffff, zj\uffff}}.\n\n2: else return \u22a5.\n3 Spectral quartet tests\n\n(\u03c3s( \u02c6\u03a3i\uffff,j) +\u2206 i\uffff,j)(\u03c3s( \u02c6\u03a3i,j\uffff) +\u2206 i,j\uffff)\n\nk\uffffs=1\n\nThis section describes the core of our learning algorithm, a spectral quartet test that determines\ntopology of the subtree induced by four observed variables {z1, z2, z3, z4}. There are four possi-\nbilities for the induced subtree, as shown in Figure 1. Our quartet test either returns the correct\ninduced subtree among possibilities in Figure 1(a)\u2013(c); or it outputs \u22a5 to indicate abstinence. If the\ntest returns \u22a5, then no guarantees are provided on the induced subtree topology. If it does return a\nsubtree, then the output is guaranteed to be the correct induced subtree (with high probability).\nThe quartet test proposed is described in Algorithm 1 (SpectralQuartetTest). The notation [a]+\ndenotes max{0, a} and [t] (for an integer t) denotes the set {1, 2, . . . , t}.\nThe quartet test is de\ufb01ned with respect to four observed variables Z := {z1, z2, z3, z4}. For each\npair of variables zi and zj, it takes as input an empirical estimate \u02c6\u03a3i,j of the second-moment matrix\nE[ziz\uffffj ], and con\ufb01dence bound parameters \u2206i,j which are functions of N, the number of samples\nused to compute the \u02c6\u03a3i,j\u2019s, a con\ufb01dence parameter \u03b4, and of properties of the distributions of zi and\nzj. In practice, one uses a single threshold \u2206 for all pairs, which is tuned by the algorithm. Our\ntheoretical analysis also applies to this case. The output of the test is either \u22a5 or a pairing of the\nvariables {{zi, zj},{zi\uffff, zj\uffff}}. For example, if the output is the pairing is {{z1, z2},{z3, z4}}, then\nFigure 1(a) is the output topology.\nEven though the con\ufb01guration in Figure 1(d) is a possibility, the spectral quartet test never returns\n{{z1, z2, z3, z4}}, as there is no correct pairing of Z. The topology {{z1, z2, z3, z4}} can be viewed\nas a degenerate case of {{z1, z2},{z3, z4}} (say) where the hidden variables h and g are determin-\nistically identical, and Condition 3 fails to hold with respect to h and g.\n\n3.1 Properties of the spectral quartet test\n\nWith exact second moments: The spectral quartet test is motivated by the following lemma, which\nshows the relationship between the singular values of second-moment matrices of the zi\u2019s and the\ninduced topology among them in the latent tree. Let detk(M ) :=\uffffk\ns=1 \u03c3s(M ) denote the product\nof the k largest singular values of a matrix M.\nLemma 1 (Perfect quartet test). Suppose that the observed variables Z = {z1, z2, z3, z4} have\nthe true induced tree topology shown in Figure 1(a), and the tree model satis\ufb01es Condition 1 and\nCondition 2. Then\ndetk(E[z1z\uffff3 ])detk(E[z2z\uffff4 ])\ndetk(E[z1z\uffff2 ])detk(E[z3z\uffff4 ])\n\ndetk(E[z1z\uffff4 ])detk(E[z2z\uffff3 ])\ndetk(E[z1z\uffff2 ])detk(E[z3z\uffff4 ])\n\ndet(E[hg\uffff])2\n\n=\n\n=\n\ndet(E[hh\uffff]) det(E[gg\uffff]) \u2264 1\n(1)\n\nand detk(E[z1z\uffff3 ])detk(E[z2z\uffff4 ]) = detk(E[z1z\uffff4 ])detk(E[z2z\uffff3 ]).\n\nThis lemma shows that given the true second-moment matrices and assuming Condition 3, the in-\nequality in (1) becomes strict and thus can be used to deduce the correct topology: the correct pairing\nis {{zi, zj},{zi\uffff, zj\uffff}} if and only if\n\ndetk(E[ziz\uffffj ])detk(E[zi\uffffz\uffffj\uffff ]) > detk(E[zi\uffffz\uffffj ])detk(E[ziz\uffffj\uffff ]).\n\n5\n\n\fIf\n\ntopology.\n\nReliability: The next lemma shows that even if the singular values of E[ziz\uffffj ] are not known ex-\nactly, then with valid con\ufb01dence intervals (that contain these singular values) a robust test can be\nconstructed which is reliable in the following sense: if it does not output \u22a5, then the output topology\nis indeed the correct topology.\nLemma 2 (Reliability). Consider the setup of Lemma 1, and suppose that Figure 1(a) is the\nand all s \u2208 [k], \u03c3s( \u02c6\u03a3i,j) \u2212 \u2206i,j \u2264\ncorrect\nfor all pairs {zi, zj}\u2282Z\n\u03c3s(E[ziz\uffffj ]) \u2264 \u03c3s( \u02c6\u03a3i,j) +\u2206 i,j, and if SpectralQuartetTest returns a pairing {{zi, zj},{zi\uffff, zj\uffff}},\nthen {{zi, zj},{zi\uffff, zj\uffff}} = {{z1, z2},{z3, z4}}.\nIn other words, the spectral quartet test never returns an incorrect pairing as long as the singular\nvalues of E[ziz\uffffj ] lie in an interval of length 2\u2206i,j around the singular values of \u02c6\u03a3i,j. The lemma\nbelow shows how to set the \u2206i,js as a function of N, \u03b4 and properties of the distributions of zi and zj\nso that this required event holds with probability at least 1\u2212 \u03b4. We remark that any valid con\ufb01dence\nintervals may be used; the one described below is particularly suitable when the observed variables\nare high-dimensional random vectors.\nLemma 3 (Con\ufb01dence intervals). Let Z = {z1, z2, z3, z4} be four random vectors. Let \uffffzi\uffff \u2264 Mi\nalmost surely, and let \u03b4 \u2208 (0, 1/6). If each empirical second-moment matrix \u02c6\u03a3i,j is computed using\nN iid copies of zi and zj, and if\n\n\u00afdi,j :=\n\nE[\uffffzi\uffff2\uffffzj\uffff2] \u2212 tr(E[ziz\uffffj ]E[ziz\uffffj ]\uffff)\nmax{\uffffE[\uffffzj\uffff2ziz\uffffi ]\uffff,\uffffE[\uffffzi\uffff2zjz\uffffj ]\uffff}\n\n\u2206i,j \u2265\uffff 2 max\uffff\uffff\uffffE[\uffffzj\uffff2ziz\uffffi ]\uffff\uffff,\uffff\uffffE[\uffffzi\uffff2zjz\uffffj ]\uffff\uffff\uffffti,j\n\nN\n\nthen with probability 1 \u2212 \u03b4, for all pairs {zi, zj}\u2282Z and all s \u2208 [k],\n\n+\n\nMiMjti,j\n\n3N\n\n,\n\n,\n\nti,j := 1.55 ln(24 \u00afdi,j/\u03b4),\n\n\u03c3s( \u02c6\u03a3i,j) \u2212 \u2206i,j \u2264 \u03c3s(E[ziz\uffffj ]) \u2264 \u03c3s( \u02c6\u03a3i,j) +\u2206 i,j.\n\n(2)\n\nConditions for returning a correct pairing: The conditions under which SpectralQuartetTest\nreturns an induced topology (as opposed to \u22a5) are now provided.\nAn important quantity in this analysis is the level of non-redundancy between the hidden variables\nh and g. Let\n\ndet(E[hg\uffff])2\n\n.\n\n\u03c12 :=\n\ndet(E[hh\uffff]) det(E[gg\uffff])\n\n(3)\nIf Figure 1(a) is the correct induced topology among {z1, z2, z3, z4}, then the smaller \u03c1 is, the\ngreater the gap between detk(E[z1z\uffff2 ])detk(E[z3z\uffff4 ]) and either of detk(E[z1z\uffff3 ])detk(E[z2z\uffff4 ])\nand detk(E[z1z\uffff4 ])detk(E[z2z\uffff3 ]). Therefore, \u03c1 also governs how small the \u2206i,j need to be for the\nquartet test to return a correct pairing; this is quanti\ufb01ed in Lemma 4. Note that Condition 3 implies\n\u03c1 \u2264 \u03c1max < 1.\nLemma 4 (Correct pairing). Suppose that (i) the observed variables Z = {z1, z2, z3, z4} have the\ntrue induced tree topology shown in Figure 1(a); (ii) the tree model satis\ufb01es Condition 1, Condi-\ntion 2, and \u03c1< 1 (where \u03c1 is de\ufb01ned in (3)), and (iii) the con\ufb01dence bounds in (2) hold for all {i, j}\nand all s \u2208 [k]. If\n\n\u2206i,j <\n\n1\n\n8k \u00b7 min\uffff1,\n\n1\n\n\u03c1 \u2212 1\uffff \u00b7 min\n\n{i,j}{\u03c3k(E[ziz\uffffj ])}\n\nfor each pair {i, j}, then SpectralQuartetTest returns the correct pairing {{z1, z2},{z3, z4}}.\n4 The Spectral Recursive Grouping algorithm\n\nThe Spectral Recursive Grouping algorithm, presented as Algorithm 2, uses the spectral quartet test\ndiscussed in the previous section to estimate the structure of a multivariate latent tree distribution\nfrom iid samples of the observed leaf variables.1 The algorithm is a modi\ufb01cation of the recursive\n\n1To simplify notation, we assume that the estimated second-moment matrices \uffff\u03a3x,y and threshold parame-\nters \u2206x,y \u2265 0 for all pairs {x, y}\u2282V obs are globally de\ufb01ned. In particular, we assume the spectral quartet\ntests use these quantities.\n\n6\n\n\fAlgorithm 2 Spectral Recursive Grouping.\n\n4:\n5:\n6:\n\n7:\n8:\n9:\n\n: Mergeable(R,L[\u00b7], \u02dcu, \u02dcv) = true} be such that\nIf no such pair exists, then halt\n\nInput: Empirical second-moment matrices \uffff\u03a3x,y for all pairs {x, y}\u2282V obs computed from N iid\nsamples from the distribution over Vobs; threshold parameters \u2206x,y for all pairs {x, y}\u2282V obs.\nOutput: Tree structure\uffffT or \u201cfailure\u201d.\n1: let R := Vobs, and for all x \u2208R , T [x] := rooted single-node tree x and L[x] := {x}.\n2: while |R| > 1 do\nlet pair {u, v} \u2208 {{\u02dcu, \u02dcv}\u2286R\n3:\nmax{\u03c3k(\uffff\u03a3x,y) : (x, y) \u2208L [u] \u00d7L [v]} is maximized.\nand return \u201cfailure\u201d.\nlet result := Relationship(R,L[\u00b7],T [\u00b7], u, v).\nif result = \u201csiblings\u201d then\nCreate a new variable h, create subtree T [h] rooted at h by joining T [u] and T [v] to h with\nedges {h, u} and {h, v}, and set L[h] := L[u] \u222aL [v].\nAdd h to R, and remove u and v from R.\nModify subtree T [u] by joining T [v] to u with an edge {u, v}, and modify L[u] := L[u] \u222a\nL[v].\nRemove v from R.\n{Analogous to above case.}\n15: Return\uffffT := T [h] where R = {h}.\n\n10:\n11:\n12:\nend if\n13:\n14: end while\n\nelse if result = \u201cu is parent of v\u201d then\n\nelse if result = \u201cv is parent of u\u201d then\n\ngrouping (RG) procedure proposed in [21]. RG builds the tree in a bottom-up fashion, where the\ninitial working set of variables are the observed variables. The variables in the working set always\ncorrespond to roots of disjoint subtrees of T discovered by the algorithm. (Note that because these\nsubtrees are rooted, they naturally induce parent/child relationships, but these may differ from those\nimplied by the edge directions in T.) In each iteration, the algorithm determines which variables in\nthe working set to combine. If the variables are combined as siblings, then a new hidden variable\nis introduced as their parent and is added to the working set, and its children are removed. If the\nvariables are combined as neighbors (parent/child), then the child is removed from the working set.\nThe process repeats until the entire tree is constructed.\nOur modi\ufb01cation of RG uses the spectral quartet tests from Section 3 to decide which subtree roots\nin the current working set to combine. Note that because the test may return \u22a5 (a null result), our\nalgorithm uses the tests to rule out possible siblings or neighbors among variables in the working\nset\u2014this is encapsulated in the subroutine Mergeable (Algorithm 3), which tests quartets of ob-\nserved variables (leaves) in the subtrees rooted at working set variables. For any pair {u, v}\u2286R\nsubmitted to the subroutine (along with the current working set R and leaf sets L[\u00b7]):\n\n\u2022 Mergeable returns false if there is evidence (provided by a quartet test) that u and v should\n\ufb01rst be joined with different variables (u\uffff and v\uffff, respectively) before joining with each\nother; and\n\n\u2022 Mergeable returns true if no quartet test provides such evidence.\n\nThe subroutine is also used by the subroutine Relationship (Algorithm 4) which determines whether\na candidate pair of variables should be merged as neighbors (parent/child) or as siblings: essentially,\nto check if u is a parent of v, it checks if v is a sibling of each child of u. The use of unreliable\nestimates of long-range correlations is avoided by only considering highly-correlated variables as\ncandidate pairs to merge (where correlation is measured using observed variables in their corre-\nsponding subtrees as proxies). This leads to a sample-ef\ufb01cient algorithm for recovering the hidden\ntree structure.\nThe Spectral Recursive Grouping algorithm enjoys the following guarantee.\nTheorem 1. Let \u03b7 \u2208 (0, 1). Assume the directed tree graphical model T over variables (random\nvectors) VT = Vobs \u222aV hid satis\ufb01es Conditions 1, 2, 3, and 4. Suppose the Spectral Recursive\n\n7\n\n\fAlgorithm 3 Subroutine Mergeable(R,L[\u00b7], u, v).\nInput: Set of nodes R; leaf sets L[v] for all v \u2208R ; distinct u, v \u2208R .\nOutput: true or false.\n1: if there exists distinct u\uffff, v\uffff \u2208R \\ { u, v} and (x, y, x\uffff, y\uffff) \u2208L [u] \u00d7L [v] \u00d7L [u\uffff] \u00d7L [v\uffff] s.t.\nSpectralQuartetTest({x, y, x\uffff, y\uffff}) returns {{x, x\uffff},{y, y\uffff}} or {{x, y\uffff},{x\uffff, y}} then return\nfalse.\n\n2: else return true.\n\nu, v \u2208R .\n\nAlgorithm 4 Subroutine Relationship(R,L[\u00b7],T [\u00b7], u, v).\nInput: Set of nodes R; leaf sets L[v] for all v \u2208R ; rooted subtrees T [v] for all v \u2208R ; distinct\nOutput: \u201csiblings\u201d, \u201cu is parent of v\u201d (\u201cu \u2192 v\u201d), or \u201cv is parent of u\u201d (\u201cv \u2192 u\u201d).\n1: if u is a leaf then assert u \uffff\u2192 v.\n2: if v is a leaf then assert v \uffff\u2192 u.\n3: let R[w] := (R \\ {w}) \u222a{ w\uffff : w\uffff is a child of w in T [w]} for each w \u2208{ u, v}.\n4: if there exists child u1 of u in T [u] s.t. Mergeable(R[u],L[\u00b7], u1, v) = false then assert \u201cu \uffff\u2192 v\u201d.\n5: if there exists child v1 of v in T [v] s.t. Mergeable(R[v],L[\u00b7], u, v1) = false then assert \u201cv \uffff\u2192 u\u201d.\n6: if both \u201cu \uffff\u2192 v\u201d and \u201cv \uffff\u2192 u\u201d were asserted then return \u201csiblings\u201d.\n7: else if \u201cu \uffff\u2192 v\u201d was asserted then return \u201cv is parent of u\u201d (\u201cv \u2192 u\u201d).\n8: else return \u201cu is parent of v\u201d (\u201cu \u2192 v\u201d).\n\nGrouping algorithm (Algorithm 2) is provided N independent samples from the distribution over\nVobs, and uses parameters given by\n\n\u2206xi,xj :=\uffff 2Bxi,xj txi,xj\n\nN\n\n+\n\nMxiMxj txi,xj\n\n3N\n\n(4)\n\nalmost surely,\n\nMxi \u2265 \uffffxi\uffff\ntxi,xj := 4 ln(4 \u00afdxi,xj n/\u03b7).\n\nwhere\n\n\u00afdxi,xj :=\n\nE[\uffffxi\uffff2\uffffxj\uffff2] \u2212 tr(E[xix\uffffj ]E[xjx\uffffi ])\n\nBxi,xj := max\uffff\uffff\uffffE[\uffffxi\uffff2xjx\uffffj ]\uffff\uffff,\uffff\uffffE[\uffffxj\uffff2xix\uffffi ]\uffff\uffff\uffff,\nmax\uffff\uffff\uffffE[\uffffxj\uffff2xix\uffffi ]\uffff\uffff,\uffff\uffffE[\uffffxi\uffff2xjx\uffffj ]\uffff\uffff\uffff ,\n\u03b3max \u00b7 (1 \u2212 \u03c1max)\uffff2 +\n\uffff \u03b32\n\n200 \u00b7 k2 \u00b7 B \u00b7 t\n\nN >\n\nmin\n\n7 \u00b7 k \u00b7 M 2 \u00b7 t\n\n\u03b32\nmin\n\u03b3max \u00b7 (1 \u2212 \u03c1max)\n\n,\n\nLet B := maxxi,xj\u2208Vobs{Bxi,xj}, M := maxxi\u2208Vobs{Mxi}, t := maxxi,xj\u2208Vobs{txi,xj}. If\n\nthen with probability at least 1\u2212 \u03b7, the Spectral Recursive Grouping algorithm returns a tree\uffffT with\n\nthe same undirected graph structure as T.\n\nConsistency is implied by the above theorem with an appropriate scaling of \u03b7 with N. The theorem\nreveals that the sample complexity of the algorithm depends solely on intrinsic spectral properties\nof the distribution. Note that there is no explicit dependence on the dimensions of the observable\nvariables, which makes the result applicable to high-dimensional settings.\n\nAcknowledgements\nPart of this work was completed while DH was at the Wharton School of the University of Penn-\nsylvania and at Rutgers University. AA was supported by in part by the setup funds at UCI and the\nAFOSR Award FA9550-10-1-0310.\n\nReferences\n[1] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference.\n\nFoundations and Trends in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n8\n\n\f[2] S. Dasgupta and L. Schulman. A probabilistic analysis of EM for mixtures of separated, spherical Gaus-\n\nsians. Journal of Machine Learning Research, 8(Feb):203\u2013226, 2007.\n\n[3] K. Chaudhuri, S. Dasgupta, and A. Vattani. Learning mixtures of Gaussians using the k-means algorithm,\n\n2009. arXiv:0912.0086.\n\n[4] D. M. Chickering, D. Heckerman, and C. Meek. Large-sample learning of Bayesian networks is NP-hard.\n\nJournal of Machine Learning Research, 5:1287\u20131330, 2004.\n\n[5] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees.\n\nTransactions on Information Theory, 14(3):462\u2013467, 1968.\n\nIEEE\n\n[6] N. Friedman, I. Nachman, and D. Pe\u00b4er. Learning Bayesian network structure from massive datasets: the\n\n\u201csparse candidate\u201d algorithm. In Fifteenth Conference on Uncertainty in Arti\ufb01cial Intelligence, 1999.\n\n[7] P. Ravikumar, M. J. Wainwright, and J. Lafferty. High-dimensional Ising model selection using \uffff1-\n\nregularized logistic regression. Annals of Statistics, 38(3):1287\u20131319, 2010.\n\n[8] M. J. Choi, J. J. Lim, A. Torralba, and A. S. Willsky. Exploiting hierarchical context on a large database\n\nof object categories. In IEEE Conference on Computer Vision and Pattern Recognition, 2010.\n\n[9] R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic Models\n\nof Proteins and Nucleic Acids. Cambridge University Press, 1999.\n\n[10] J. Wishart. Sampling errors in the theory of two factors. British Journal of Psychology, 19:180\u2013187,\n\n1928.\n\n[11] K. Bollen. Structural Equation Models with Latent Variables. John Wiley & Sons, 1989.\n[12] P. Buneman. The recovery of trees from measurements of dissimilarity. In F. R. Hodson, D. G. Kendall,\nand P. Tautu, editors, Mathematics in the Archaeological and Historical Sciences, pages 387\u2013395. 1971.\n\n[13] J. Pearl and M. Tarsi. Structuring causal trees. Journal of Complexity, 2(1):60\u201377, 1986.\n[14] N. Saitou and M. Nei. The neighbor-joining method: A new method for reconstructing phylogenetic trees.\n\nMolecular Biology and Evolution, 4:406\u2013425, 1987.\n\n[15] P. L. Erd\u00a8os, L. A. Sz\u00b4ekely, M. A. Steel, and T. J. Warnow. A few logs suf\ufb01ce to build (almost) all trees:\n\nPart II. Theoretical Computer Science, 221:77\u2013118, 1999.\n\n[16] M. R. Lacey and J. T. Chang. A signal-to-noise analysis of phylogeny estimation by neighbor-joining:\n\ninsuf\ufb01ciency of polynomial length sequences. Mathematical Biosciences, 199(2):188\u2013215, 2006.\n\n[17] P. L. Erd\u00a8os, L. A. Sz\u00b4ekely, M. A. Steel, and T. J. Warnow. A few logs suf\ufb01ce to build (almost) all trees\n\n(I). Random Structures and Algorithms, 14:153\u2013184, 1999.\n\n[18] E. Mossel. Phase transitions in phylogeny. Transactions of the American Mathematical Society,\n\n356(6):2379\u20132404, 2004.\n\n[19] C. Daskalakis, E. Mossel, and S. Roch. Evolutionary trees and the Ising model on the Bethe lattice: A\n\nproof of Steel\u2019s conjecture. Probability Theory and Related Fields, 149(1\u20132):149\u2013189, 2011.\n\n[20] H. Kesten and B. P. Stigum. Additional limit theorems for indecomposable multidimensional galton-\n\nwatson processes. Annals of Mathematical Statistics, 37:1463\u20131481, 1966.\n\n[21] M. J. Choi, V. Tan, A. Anandkumar, and A. Willsky. Learning latent tree graphical models. Journal of\n\nMachine Learning Research, 12:1771\u20131812, 2011.\n\n[22] M. S. Bartlett. Further aspects of the theory of multiple regression. Mathematical Proceedings of the\n\nCambridge Philosophical Society, 34:33\u201340, 1938.\n\n[23] R. J. Muirhead and C. M. Waternaux. Asymptotic distributions in canonical correlation analysis and other\n\nmultivariate procedures for nonnormal populations. Biometrika, 67(1):31\u201343, 1980.\n\n[24] E. Mossel and S. Roch. Learning nonsingular phylogenies and hidden Markov models. Annals of Applied\n\nProbability, 16(2):583\u2013614, 2006.\n\n[25] D. Hsu, S. M. Kakade, and T. Zhang. A spectral algorithm for learning hidden Markov models.\n\nTwenty-Second Annual Conference on Learning Theory, 2009.\n\nIn\n\n[26] S. M. Siddiqi, B. Boots, and G. J. Gordon. Reduced-rank hidden Markov models. In Thirteenth Interna-\n\ntional Conference on Arti\ufb01cial Intelligence and Statistics, 2010.\n\n[27] L. Song, S. M. Siddiqi, G. J. Gordon, and A. J. Smola. Hilbert space embeddings of hidden Markov\n\nmodels. In International Conference on Machine Learning, 2010.\n\n[28] E. S. Allman, C. Matias, and J. A. Rhodes. Identi\ufb01ability of parameters in latent structure models with\n\nmany observed variables. The Annals of Statistics, 37(6A):3099\u20133132, 2009.\n\n[29] J. Pearl. Probabilistic Reasoning in Intelligent Systems\u2014Networks of Plausible Inference. Morgan Kauf-\n\nmann, 1988.\n\n[30] D. Hsu, S. M. Kakade, and T. Zhang. Dimension-free tail inequalities for sums of random matrices, 2011.\n\narXiv:1104.1672.\n\n9\n\n\f", "award": [], "sourceid": 1143, "authors": [{"given_name": "Animashree", "family_name": "Anandkumar", "institution": null}, {"given_name": "Kamalika", "family_name": "Chaudhuri", "institution": null}, {"given_name": "Daniel", "family_name": "Hsu", "institution": null}, {"given_name": "Sham", "family_name": "Kakade", "institution": null}, {"given_name": "Le", "family_name": "Song", "institution": null}, {"given_name": "Tong", "family_name": "Zhang", "institution": null}]}