{"title": "Fast Computation of Posterior Mode in Multi-Level Hierarchical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1913, "page_last": 1920, "abstract": "Multi-level hierarchical models provide an attractive framework for incorporating correlations induced in a response variable organized in a hierarchy. Model fitting is challenging, especially for hierarchies with large number of nodes. We provide a novel algorithm based on a multi-scale Kalman filter that is both scalable and easy to implement. For non-Gaussian responses, quadratic approximation to the log-likelihood results in biased estimates. We suggest a bootstrap strategy to correct such biases. Our method is illustrated through simulation studies and analyses of real world data sets in health care and online advertising.", "full_text": "Fast Computation of Posterior Mode in Multi-Level\n\nHierarchical Models\n\nLiang Zhang\n\nDepartment of Statistical Science\n\nDuke University\n\nDurham, NC 27708\n\nDeepak Agarwal\nYahoo! Research\n\n2821 Mission College Blvd.\n\nSanta Clara, CA 95054\n\nlz9@stat.duke.edu\n\ndagarwal@yahoo-inc.com\n\nAbstract\n\nMulti-level hierarchical models provide an attractive framework for incorporating\ncorrelations induced in a response variable that is organized hierarchically. Model\n\ufb01tting is challenging, especially for a hierarchy with a large number of nodes. We\nprovide a novel algorithm based on a multi-scale Kalman \ufb01lter that is both scalable\nand easy to implement. For Gaussian response, we show our method provides the\nmaximum a-posteriori (MAP) parameter estimates; for non-Gaussian response,\nparameter estimation is performed through a Laplace approximation. However,\nthe Laplace approximation provides biased parameter estimates that is corrected\nthrough a parametric bootstrap procedure. We illustrate through simulation studies\nand analyses of real world data sets in health care and online advertising.\n\n1 Introduction\n\nIn many real-world prediction problems, the response variable of interest is clustered hierarchically.\nFor instance, in studying the immunization status of a set of children in a particular geographic loca-\ntion, the children are naturally clustered by families, which in turn are clustered into communities.\nThe clustering often induce correlations in the response variable; models that exploit this provide\nsigni\ufb01cant improvement in predictive performance. Multi-level hierarchical models provide an at-\ntractive framework for modeling such correlations. Although routinely applied to moderate sized\ndata (few thousand nodes) in several \ufb01elds like epidemiology, social sciences, biology [3], model\n\ufb01tting is computationally expensive and is usually performed through a Cholesky decomposition\nof a q (number of nodes in the hierarchy) dimensional matrix. Recently, such models have shown\npromise in a novel application of internet advertising [1] where the goal is to select top-k adver-\ntisements to be shown on a webpage to maximize the click-through rates. To capture the semantic\nmeaning of content in a parsimonious way, it is commonplace to classify webpages and ads into\nlarge pre-de\ufb01ned hierarchies. The hierarchy in such applications consist of several levels and the\ntotal number of nodes may run into millions. Moreover, the main goal is to exploit the hierarchy\nfor obtaining better predictions; computing the full posterior predictive distribution is of secondary\nimportance. Existing \ufb01tting algorithms are dif\ufb01cult to implement and do not scale well for such\nproblems. In this paper, we provide a novel, fast and easy to implement algorithm to compute the\nposterior mode of parameters for such models on datasets organized hierarchically into millions of\nnodes with several levels. The key component of our algorithm is a multi-scale Kalman \ufb01lter that\nexpedites the computation of an expensive to compute conditional posterior.\n\nThe central idea in multi-level hierarchical (MLH hereafter) models is \u201cshrinkage\u201d across the nodes\nin the hierarchy. More speci\ufb01cally, these models assume a multi-level prior wherein parameters of\nchildren nodes are assumed to be drawn from a distribution centered around the parameter of the\nparent. This bottom-up, recursive assumption provides a posterior whose estimates at the \ufb01nest res-\nolution are smoothed using data on the lineage path of the node in the hierarchy. The fundamental\n\n1\n\n\fNotation Meaning\nTj\nmj\nq\npa(r)\nci(r)\nnr\nyir\nY\nxir\nX\n\u03b2\n\u03c6j\nr\n\u03c6\nV\n\u03b3j\n\u03b3\n\u03c6j\nr|r\n\u03c3j\nr|r\n\u02c6\n\u03c6j\nr\n\u03c3j\nr\n\nLevel j of the hierarchy T\nThe number of nodes at level j in T\nThe total number of nodes in T\nThe parent node of node r in T\nThe ith child node of node r in T\nThe number of observations at leaf node r\nThe ith observation (response) at leaf node r\n{yir, i = 1,\u00b7\u00b7\u00b7 , nr, r \u2208 T}\nThe ith observation (p-dimensional covariates) at leaf node r\n{xir, i = 1,\u00b7\u00b7\u00b7 , nr, r \u2208 T}\nThe regression parameter vector associated with X\nThe random effect parameter at node r at level j\n{\u03c6j\nThe residual variance of yir, if yir has a Gaussian model\nThe variance of \u03c6j\n{\u03b31,\u00b7\u00b7\u00b7 , \u03b3L}\nThe mean of \u03c6j\nThe variance of \u03c6j\nThe mean of \u03c6j\nThe variance of \u03c6j\n\nr|{yir\u2032, i = 1,\u00b7\u00b7\u00b7 , nr\u2032,\u2200r\u2032 \u227a r}\n\nr|{yir\u2032, i = 1,\u00b7\u00b7\u00b7 , nr\u2032,\u2200r\u2032 \u227a r}\nr|{yir\u2032, i = 1,\u00b7\u00b7\u00b7 , nr\u2032,\u2200r\u2032 \u2208 TL}\n\nr|{yir\u2032, i = 1,\u00b7\u00b7\u00b7 , nr\u2032,\u2200r\u2032 \u2208 TL}\n\nr, r \u2208 T, j = 1,\u00b7\u00b7\u00b7 , L}\n\nr for all the nodes at level j\n\nTable 1: A list of the key notations.\n\nassumption is that the hierarchy, determined from domain knowledge, provides a natural clustering\nto account for latent processes generating the data which, when incorporated into the model, im-\nprove predictions. Although MLH models are intuitive, parameter estimation presents a formidable\nchallenge, especially for large hierarchies. For Gaussian response, the main computational bottle-\nneck is the Cholesky factorization of a dense covariance matrix whose order depends on the number\nof nodes, this is expensive for large problems. For non-Gaussian response (e.g binary data), the non-\nquadratic nature of the log-likelihood adds on an additional challenge of approximating an integral\nwhose dimension depends on the number of nodes in the hierarchy. This is an active area of research\nin statistics with several solutions being proposed, such as [5] (see references therein as well). For\nCholesky factorization, techniques based on sparse factorization of the covariance matrix have been\nrecently proposed in [5]. For non-Gaussian models, solutions require marginalization over a high di-\nmensional integral and is often accomplished through higher order Taylor series approximations[6].\nHowever, these techniques involve linear algebra that is often non-intuitive and dif\ufb01cult to imple-\nment. A more natural computational scheme that exploits the structure of the model is based on\nGibbs sampling; however, it is not scalable due to slow convergence.\nOur contributions are as follows: We provide a novel \ufb01tting procedure based on multi-scale Kalman\n\ufb01lter algorithm that directly exploits the hierarchical structure of the problem and computes the pos-\nterior mode of MLH parameters. The complexity of our method is almost linear in the number of\nnodes in the hierarchy. Other than scalability, our \ufb01tting procedure is more intuitive and easy to\nimplement. We note that although multi-scale Kalman \ufb01lters have been studied in the electrical\nengineering literature [2] and spatial statistics, their application to \ufb01tting MLH is novel. Moreover,\n\ufb01tting such models to non-Gaussian data present formidable challenges as we illustrate in the paper.\nWe provide strategies to overcome those through a bootstrap correction and compare with the com-\nmonly used cross-validation approach. Our methods are illustrated on simulated data, benchmark\ndata and data obtained from an internet advertising application.\n\n2 MLH for Gaussian Responses\n\nAssume we have a hierarchy T consisting of L levels (root is level 0), for which mj, j = 0,\u00b7\u00b7\u00b7 , L,\ndenotes the number of nodes at level j. Denote the set of nodes at level j in the hierarchy T as Tj.\nFor node r in T , denote the parent of r as pa(r), and the ith child of node r as ci(r). If a node r\u2032 is\n\n2\n\n\fa descendent of r, we say r\u2032 \u227a r. Since the hierarchy has L levels, TL denotes the set of leaf nodes\nin the hierarchy. Let yir, i = 1,\u00b7\u00b7\u00b7 , nr denote the ith observation at leaf node r, and xir denote the\np-dimensional covariate vector associated with yir. For simplicity, we assume all observations are\navailable at leaf nodes (a more general case where each node in the hierarchy can have observations\nis easily obtained from our algorithm). Consider the Gaussian MLH de\ufb01ned by\n\n\u2032\n\nir\u03b2 + \u03c6L\n\nr , V ),\n\nyir|\u03c6L\n\nr \u223c N (x\n\nr|\u03c6j\u22121\n\npa(r) \u223c N (\u03c6j\u22121\n\n0 = 0. The form of p(\u03c6j\n\nr|\u03c6j\u22121\npa(r), \u03b3j); j = 1,\u00b7\u00b7\u00b7 , L,\n\npa(r)), j = 0,\u00b7\u00b7\u00b7 , L, where \u03c60\nr|\u03c6j\u22121\n\u03c6j\n\n(1)\nwhere \u03b2 is a \ufb01xed effect parameter vector and \u03c6j\nr is a random effect associated with node r\nat level j with joint distribution de\ufb01ned through a set of hierarchical conditional distributions\npa(r)), j = 1,\u00b7\u00b7\u00b7 , L is assumed\np(\u03c6j\nto be\n(2)\nwhere \u03b3 = (\u03b31,\u00b7\u00b7\u00b7 , \u03b3L) is a vector of level-speci\ufb01c variance components that control the amount\nof smoothing. To complete the model speci\ufb01cation in a Bayesian framework, we put a vague prior\non V (\u03c0(V ) \u221d 1/V ) and a mild quadratic prior on \u03b3i (\u03c0(\u03b3i|V ) \u221d V /(V + \u03b3i)2). For \u03b2, we assume\na non-informative prior, i.e., \u03c0(\u03b2) \u221d 1.\nThe speci\ufb01cation of MLH given by Equation 2 is referred to as the centered parametrization and\nwas shown to provide good performance in a fully Bayesian framework by [9]. An equivalent way\nr \u223c N (0, \u03b3j) to the\nof specifying MLH is obtained by associating independent random variables bj\nnodes and replacing \u03c6L\nr parameters along the lineage path from root to\nleaf node in the hierarchy. We denote this compactly as z\u2032\nr for all the\nnodes in the hierarchy, and zr is a vector of 0/1\u2019s turned on for nodes in the path of node r. More\ncompactly, let y = {yir, i = 1,\u00b7\u00b7\u00b7 , nr, r \u2208 T}, and X as well as Z be the corresponding matrix of\n\u03b2 + Zb, V I) with b \u223c N (0, \u2126(\u03b3)).\nvectors xir and zr for i = 1,\u00b7\u00b7\u00b7 nr and r \u2208 T , then y \u223c N (X\nThe problem is to compute the posterior mode of (\u03b2p\u00d71, bq\u00d71, \u03b3L\u00d71, V ) where q = PL\nj=1 mj. The\nmain computational bottleneck is computing the Cholesky factor of a q\u00d7q matrix (Z\nZ +\u2126\u22121), this\nis expensive for large values of q. Existing state-of-the-art methods are based on sparse Cholesky\nfactorization; we provide a more direct way. In fact, our method provides a MAP estimate of the\nparameters for the Gaussian case. For non-Gaussian case, we provide an approximation to the MAP\nthrough the Laplace method coupled with a bootstrap correction. We also note that our method\napply if the random effects are vectors and enter into equation (2) as linear combination with some\ncovariate vector. In this paper, we illustrate through a scalar.\n\nrb, where b is a vector of bj\n\nr in (1) by the sum of the bj\n\n\u2032\n\n\u2032\n\n2.1 Model Fitting\n\nThroughout, we work with the parametrization speci\ufb01ed by \u03c6. The main component of our \ufb01tting\nalgorithm is computing the conditional posterior distribution of \u03c6 = {\u03c6j\nr, r \u2208 T, j = 1,\u00b7\u00b7\u00b7 , L}\ngiven (\u03b2, V, \u03b3). Since the parameters V and \u03b3 are unknown, we estimate them through an EM\nalgorithm. The multi-scale Kalman \ufb01lter (described next) computes the conditional posterior of \u03c6\nmentioned above and is used in the inner loop of the EM.\n\nAs in temporal state space models, the Kalman \ufb01lter consists of two steps - a)Filtering: where one\npropagates information from leaves to the root and b) Smoothing: where information is propagated\nfrom root all the way down to the leaves.\n\nFiltering:\nDenote the current estimates of \u03b2, \u03b3 and V as \u02c6\u03b2, \u02c6\u03b3, and \u02c6V respectively. Then, eir = yir \u2212 x\nare the residuals and V ar(\u03c6j\neffects. If the conditional posterior distribution \u03c6L\nstep is to update \u03c6L\nfor Gaussian models\n\n\u02c6\u03b2\ni=1 \u02c6\u03b3i, r \u2208 Tj are the marginal variances of the random\nr|r), the \ufb01rst\nr using standard Bayesian update formula\n\nr |{yir, i = 1,\u00b7\u00b7\u00b7 , nr} \u223c N (\u03c6L\n\nr|r for all leaf random effects \u03c6L\n\nr) = \u03a3j = Pj\n\nr|r and \u03c3L\n\nr|r, \u03c3L\n\nir\n\n\u2032\n\nnr\n\n\u03c6L\nr|r =\n\n\u03c3L\nr|r =\n\n\u03a3L\n\neir\n\nP\n\ni=1\n\n\u02c6V + nr\u03a3L\n\n\u03a3L \u02c6V\n\n\u02c6V + nr\u03a3L\n\n,\n\n.\n\n3\n\n(3)\n\n(4)\n\n\fNext, the posteriors \u03c6j\nr|r), are recursively updated\nfrom j = L \u2212 1 to j = 1, by regressing the parent node effect towards each child and combining\ninformation from all the children.\n\nr|{yir\u2032, i = 1,\u00b7\u00b7\u00b7 , nr\u2032,\u2200r\u2032 \u227a r} \u223c N (\u03c6j\n\nr|r, \u03c3j\n\nTo provide intuition about regression step, it is useful to invert the state equation (2) and express the\ndistribution of \u03c6j\u22121\n\npa(r) conditional on \u03c6j\n\n(5)\n\n(6)\n\npa(r)|\u03c6j\nr))\nSimple algebra provides the conditional expectation and variance of \u03c6j\u22121\n\npa(r) \u2212 E(\u03c6j\u22121\n\n\u03c6j\u22121\npa(r) = E(\u03c6j\u22121\n\nr) + (\u03c6j\u22121\n\npa(r)|\u03c6j\n\nr as\n\nr. Note that\npa(r)|\u03c6j\n\n\u03c6j\u22121\npa(r) = Bj\u03c6j\n\nr + \u03c8j\nr,\n\ni=1 \u02c6\u03b3i/ Pj\n\ni=1 \u02c6\u03b3i, correlation between any two siblings at level j and \u03c8j\n\nwhere Bj = Pj\u22121\nN (0, Bj \u02c6\u03b3j).\nFirst, a new prior is obtained for the parent node based on the current estimate of each child by\nplugging-in the current estimates of a child into equation (6). For the ith child of node r (here we\nassume that r is at level j \u2212 1, and ci(r) is at level j),\n\u03c6j\u22121\nr|ci(r) = Bj\u03c6j\n\nci(r)|ci(r),\n\nr \u223c\n\n(7)\n\n\u03c3j\u22121\nr|ci(r) = B2\n\nj \u03c3j\n\nci(r)|ci(r) + Bj \u02c6\u03b3j,\n\nNext, we combine information obtained by the parent from all its children.\n\n\u03c6j\u22121\nr|r = \u03c3j\u22121\n\nr|r\n\nkr\n\nX\n\ni=1\n\n(\u03c6j\u22121\n\nr|ci(r)/\u03c3j\u22121\n\nr|ci(r)),\n\n1/\u03c3j\u22121\n\nr|r = \u03a3\u22121\n\nj\u22121 +\n\nkr\n\nX\n\ni=1\n\n((1/\u03c3j\u22121\n\nr|ci(r)) \u2212 \u03a3\u22121\n\nj\u22121).\n\n(8)\n\n(9)\n\n(10)\n\nwhere kr is the number of children of node r at level j \u2212 1.\nSmoothing:\nIn the smoothing step, parents propagate information recursively from root to the leaves to provide\nus with the posterior of each \u03c6j\nr based on the entire data. Denoting the posterior mean and variance\nof \u03c6j\nFor level 1 nodes, set \u02c6\u03c61\n\nr respectively, the update equations are given below.\nr|r.\n\nr given all the observations by \u02c6\n\u03c6j\nr and \u03c3j\n\nr|r, and \u03c31\n\nr = \u03c31\n\nr = \u03c61\n\nFor node r at other levels,\n\u02c6\n\u03c6j\nr = \u03c6j\n\nand let\n\nr|rBj(\n\nr|r + \u03c3j\nr|r + \u03c3j2\n\nr|rB2\n\nj (\u03c3j\u22121\n\n\u02c6\npa(r) \u2212 \u03c6j\u22121\n\u03c6j\u22121\npa(r) \u2212 \u03c3j\u22121\n\npa(r)|r)/\u03c3j\npa(r)|r)/\u03c3j2\n\npa(r)|r,\n\npa(r)|r,\n\nr = \u03c3j\n\u03c3j\n\n\u03c3j,j\u22121\nr,pa(r) = \u03c3j\n\nr|rBj\u03c3j\u22121\n\npa(r)/\u03c3j\u22121\n\npa(r)|r.\n\n(11)\n\n(12)\n\n(13)\n\nThe computational complexity of the algorithm is linear in the number of nodes in the hierarchy and\nfor each parent node, we perform an operation which is cubic in the number of children. Hence,\nfor most hierarchies that arise in practical applications, the complexity is \u201cessentially\u201d linear in the\nnumber of nodes.\n\nExpectation Maximization:\n\nTo estimate all parameters simultaneously, we use an EM algorithm which assumes the \u03c6 parameters\nto be the missing latent variables. The expectation step consists of computing the expected value of\ncomplete log-posterior with respect to the conditional distribution of missing data \u03c6, obtained using\nthe multi-scale Kalman \ufb01lter algorithm. The maximization step obtains revised estimates of other\nparameters by maximizing the expected complete log-posterior.\n\n4\n\n\f\u02c6V = X\n\nr\u2208TL\n\nnr\n\nP\n\ni=1\n\n(eir \u2212 \u02c6\u03c6L\nP\nr\u2208TL\n\nr )2 + nr\u03c3L\n\nr\n\nnr\n\n,\n\nFor j = 1,\u00b7\u00b7\u00b7 , L,\n\nP\nr\u2208Tj\n\n\u02c6\u03b3j =\n\n(\u03c3j\n\nr + \u03c3j\u22121\n\nr,pa(r) + ( \u02c6\u03c6r\n\npa(r) \u2212 2\u03c3j,j\u22121\n|mj|\n\nj\n\n\u2212 \u02c6\u03c6pa(r)\n\nj\u22121\n\n)2)\n\n.\n\n(14)\n\n(15)\n\nUpdating \u02c6\u03b2:\nWe use the posterior mean of \u03c6 obtained from the Kalman \ufb01ltering step, to compute the posterior\nmean of \u03b2 as given in equation (16).\n\nwhere \u02c6\u03c6L is the vector of \u02c6\u03c6L\n\nr corresponding to each observation yir at different leaf node r.\n\n\u02c6\u03b2 = (X \u2032X)\u22121X \u2032(Y \u2212 \u02c6\u03c6L),\n\n(16)\n\n2.2 Simulation Performance\n\nWe \ufb01rst perform a simulation study with a hierarchy described in [7, 8]. The data focus on 2449\nGuatemalan children who belong to 1558 families who in turn live in 161 communities. The re-\nsponse variable of interest is binary with a positive label assigned to a child if he/she received a\nfull set of immunizations. The actual data contains 15 covariates capturing individual, family and\ncommunity level characteristics as shown in Table 2. For our simulation study, we consider only\nthree covariates, with the coef\ufb01cient vector \u03b2 set with entries all equal to 1. We simulated Gaussian\nresponse as follows: yir|b \u223c N (x\nr \u223c N (0, 1). We\nsimulated 100 data sets and compared the estimates from Kalman \ufb01lter to the one obtained from\nstandard routine lme4 in the statistical software R. Results from our procedure agreed almost ex-\nactly with those obtained from lme4, our computations was many times faster than lme4. The EM\nmethod converged rapidly and required at most 30 iterations.\n\nr \u223c N (0, 4), and b2\n\nr, 10) where b1\n\nir\u03b2 + b1\n\nr + b2\n\n\u2032\n\n3 MLH for Non-Gaussian Responses\n\nWe discuss model \ufb01tting for Bernoulli response but note that other distributions in the general-\nized linear model family can be easily \ufb01tted using the procedure. Let yir \u223c Bernoulli(pir), i.e.\nP (yir) = pyir\nbe the log-odds. The MLH logistic regression is\nde\ufb01ned as:\n\nir (1 \u2212 pir)1\u2212yir . Let \u03b8ir = log pir\n\u03b8ir = x\n\n(17)\n\n1\u2212pir\n\nir\u03b2 + \u03c6L\nr ,\n\n\u2032\n\nwith the same multi-level prior as described in equation (2). The non-conjugacy of the normal\nmulti-level prior makes the computation more dif\ufb01cult. We take recourse to Taylor series approxi-\nmation coupled with the Kalman \ufb01lter algorithm. The estimates obtained are biased; we recommend\ncross-validation and parametric bootstrap (adapted from [4]) to correct for the bias. The bootstrap\nprocedure though expensive is easily parallelizable and accurate.\n\n3.1 Approximation Methods\n\nr , where \u02c6\u03b2, \u02c6\u03c6L\n\nLet \u03b7ir = xir \u02c6\u03b2 + \u02c6\u03c6L\nr are current estimates of the parameters in our algorithm. We do\na quadratic approximation of the log-likelihood through a second order Taylor expansion (Laplace\napproximation) around \u03b7ir. This enables us to do the calculations as in the Gaussian case with the\nresponse yir being replaced by Zir where\n\nZir = \u03b7ir +\n\n2yir \u2212 1\n\ng((2yir \u2212 1)\u03b7ir)\n\n,\n\n5\n\n(18)\n\n\fAlgorithm 1 The bootstrap procedure\n\nLet \u03b8 = (\u03b2, \u03b3).\nObtain \u02dc\u03b8 as an initial estimate of \u03b8. Bias b(0) = 0.\nfor i = 1 to N do\n\u02c6\u03b8 = \u02dc\u03b8 \u2212 b(i).\nfor j = 1 to M do\n\nUse \u02c6\u03b8 to simulate new data j, by simulating \u03c6 and the corresponding Y .\nFor data j, obtain an new estimate of \u03b8 as \u02dc\u03b8(j).\n\nend for\nb(i+1) = 1\nM\n\nend for\n\nM\n\nP\n\nj=1\n\n\u02dc\u03b8(j) \u2212 \u02c6\u03b8.\n\nand g(x) = 1/(1 + exp(\u2212x)). Approximately,\n\nir\u03b2 + \u03c6L\nr ,\n\nZir \u223c N (x\u2032\n\u02c6\u03b2, and the approximated variance of Zir as Vir. Analogous to equa-\n\ng(\u03b7ir)g(\u2212\u03b7ir)\n\n(19)\n\n1\n\n).\n\nNow denote eir = Zir \u2212 x\u2032\ntion (3) and (4), the resulting \ufb01ltering step for the leaf nodes becomes:\n\nir\n\n\u03c6L\nr|r = \u03c3L\nr|r\n\n\u03c3L\nr|r = (\n\n1\n\u03a3L\n\n+\n\nnr\n\nX\n\ni=1\n\nnr\n\nX\n\ni=1\n\neir\nVir\n\n,\n\n1\nVir\n\n)\u22121.\n\nThe step for estimating \u03b2 becomes:\n\n(20)\n\n(21)\n\n(22)\n\n\u02c6\u03b2 = (X \u2032W X)\u22121X \u2032W (Z \u2212 \u02c6\u03c6L),\n\n). All the other computational steps remain the same as in the Gaussian case.\n\nwhere W = diag( 1\nVir\n\n3.2 Bias correction\n\nTable 2 shows estimates of parameters obtained from our approximation method in the column titled\nKF . Compared to the unbiased estimates obtained from the slow Gibbs sampler, it is clear our\nestimates are biased. Our bias correction procedure is described in Algorithm 1. In general, a value\nof M = 50 with about 100 \u2212 200 iterations worked well for us. The bias corrected estimates are\nreported under KF-B in Table 2. The estimates after bootstrap correction are closer to the estimates\nobtained from Gibbs sampling. It is also customary to estimate hyper parameters like the \u03b3 using\na tuning dataset. To test the performance of such a strategy, we created a two-dimensional grid for\n\n(\u221a\u03b31,\u221a\u03b32) for the epidemiological Guatemalan data set ranging in [.1, 3]\u00d7[.1, 3] and computed the\n\nlog-likelihood on a 10% randomly sampled hold-out data. For each point on the two-dimensional\ngrid, we estimated the other parameters \u03c6 and \u03b2, using our EM algorithm that does not update the\nvalue of \u03b3. The estimates at the optimal value of \u03b3 are shown in Table 2 under KF-C. The estimates\nare better than KF but worse than KF-B.\n\nBased on our \ufb01ndings, we recommend KF-B when computing resources are available (especially\nmultiple processors) and running time is not a big constraint; if runtime is an issue we recommend\ngrid search using a small number of points around the initial estimate.\n\n4 Content Match Data Analysis\n\nWe analyze data from an internet advertising application where every showing of an ad on a web\npage (called an impression) constitutes an event. The goal is to rank ads on a given page based on\nclick-through rates. Building a predictive model for click-rates via features derived from pages and\n\n6\n\n\fEffects\nFixed effects\nIndividual\nChild age \u2265 2 years\nMother age \u2265 25 years\nBirth order 2-3\nBirth order 4-6\nBirth order \u2265 7\nFamily\nIndigenous, no Spanish\nIndigenous Spanish\nMother\u2019s education primary\nMother\u2019s education secondary\nor better\nHusband\u2019s education primary\nHusband\u2019s education secondary\nor better\nHusband\u2019s education missing\nMother ever worked\nCommunity\nRural\nProportion indigenous, 1981\n\nRandom effects\nStandard deviations \u03b3\nFamily\nCommunity\n\nKF\n\nKF-B\n\nKF-C\n\nGibbs\n\n0.99\n-0.09\n-0.10\n0.13\n0.20\n\n-0.05\n0.00\n0.22\n0.23\n\n0.30\n0.27\n\n0.02\n0.21\n\n-0.50\n-0.67\n\n1.77\n-0.16\n-0.18\n0.25\n0.36\n\n-0.11\n0.01\n0.44\n0.44\n\n0.53\n0.48\n\n0.04\n0.35\n\n-0.91\n-1.23\n\n1.18\n-0.10\n-0.25\n0.10\n0.21\n\n0.02\n0.02\n0.32\n0.27\n\n0.39\n0.35\n\n-0.08\n0.24\n\n-0.62\n-0.89\n\n1.84\n-0.26\n-0.29\n0.21\n0.50\n\n-0.22\n-0.11\n0.48\n0.46\n\n0.59\n0.55\n\n0.00\n0.42\n\n-0.96\n-1.22\n\n0.74\n0.56\n\n2.40\n1.05\n\n1.92\n0.81\n\n2.60\n1.13\n\nTable 2: Estimates for the binary MLH model of complete immunization (Kalman Filtering results)\n\nads is an attractive approach. In our case, semantic features are obtained by classifying pages and\nads into a large seven-level content hierarchy that is manually constructed by humans. We form\na new hierarchy (a pyramid) by taking the cross product of the two hierarchies. This is used to\nestimate smooth click-rates of (page,ad) pairs.\n\n4.1 Training and Test Data\n\nAlthough the page and ad hierarchies consist of 7 levels, classi\ufb01cation is often done at coarser levels\nby the classi\ufb01er. In fact, the average level at which classi\ufb01cation took place is 3.8. To train our\nmodel, we only consider the top 3 levels of the original hierarchy. Pages and ads that are classi\ufb01ed at\ncoarser levels are randomly assigned to the children nodes. Overall, the pyramid has 441, 25751 and\n241292 nodes for the top 3 levels. The training data were collected by con\ufb01ning to a speci\ufb01c subset\nof data which is suf\ufb01cient to illustrate our methodology but in no way representative of the actual\npublisher traf\ufb01c received by the ad-network under consideration. The training data we collected\nspans 23 days and consisted of approximately 11M binary observations with approximately 1.9M\nclicks. The test set consisted of 1 day\u2019s worth of data with approximately .5M observations. We\nrandomly split the test data into 20 equal sized partitions to report our results. The covariates include\nthe position at which an ad is shown; ranking ads on pages after adjusting for positional effects is\nimportant since the positional effects introduce strong bias in the estimates In the training data a large\nfraction of leaf nodes in the pyramid (approx 95%) have zero clicks, this provides a good motivation\nto \ufb01t the binary MLH on this data to get smoother estimates at leaf nodes by using information at\ncoarser resolutions.\n\n4.2 Results\n\nWe compare the following models using log-likelihood on the test data: a) The model which predicts\na constant probability for all examples, b) 3 level MLH but without positional effects, c) top 2 level\nMLH to illustrate the gains of using information at a \ufb01ner resolution, and d) 3 level MLH with\npositional effects to illustrate the generality of the approach; one can incorporate both additional\nfeatures and the hierarchy into a single model. Figure 1 shows the distribution of average test\nlikelihood on the partitions. As expected, all variations of MLH are better than the constant model.\nThe MLH model which uses only 2 levels is inferior to the 3 level MLH while the general model\nthat uses both covariates and hierarchy is the best.\n\n7\n\n\fd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n\n\u2212\ng\no\nL\n\n5\n4\n\n.\n\n2\n\u2212\n\n.\n\n5\n5\n2\n\u2212\n\n.\n\n5\n6\n2\n\u2212\n\n2lev\n\n3lev\n\n3lev\u2212pos\n\ncon\n\nModel\n\nFigure 1: Distribution of test log-likelihood on 20 equal sized splits of test data.\n\n5 Discussion\n\nIn applications where data is aggregated at multiple resolutions with sparsity at \ufb01ner resolutions,\nmulti-level hierarchical models provide an attractive class to reduce variance by smoothing estimates\nat \ufb01ner resolutions using data at coarser resolutions. However, the smoothing provides a better bias-\nvariance tradeoff only when the hierarchy provides a natural clustering for the response variable and\ncaptures some latent characteristics of the process; often true in practice. We proposed a fast novel\nalgorithm to \ufb01t these models based on a multi-scale Kalman \ufb01lter that is both scalable and easy to\nimplement. For the non-Gaussian case, the estimates are biased but performance can be improved\nby using a bootstrap correction or estimation through a tuning set. In future work, we will report on\nmodels that generalize our approach to arbitrary number of hierarchies that may all have different\nstructure. This is a challenging problem since in general cross-product of trees is not a hierarchy but\na graph.\n\nReferences\n\n[1] D. Agarwal, A. Broder, D. Chakrabarti, D. Diklic, V. Josifovski, and M. Sayyadian. Estimating\n\nrates of rare events at multiple resolutions. In KDD, pages 16\u201325, 2007.\n\n[2] K. C. Chou, A. S. Willsky, and R. Nikoukhah. Multiscale systems, Kalman \ufb01lters, and Ricatti\n\nequations. IEEE Transactions on Automatic Control, 39:479\u2013492, 1994.\n\n[3] A. Gelman and J. Hill. Data Analysis sing Regression and Multi-Level/Hierarchical Models.\n\nCambridge University Press, 2007.\n\n[4] A. Y. C. Kuk. Asymptotically unbiased estimation in generalized linear models with random\neffects. Journal of the Royal Statistical Society, Series B (Methodological),, 57:395\u2013407, 1995.\n[5] J. C. Pinheiro and D. M. Bates. Mixed-Effects Models in S and S-PLUS. Springer-Verlag, New\n\nYork, 2000.\n\n[6] S. W. Raudenbush, M. L. Yang, and M. Yosef. Maximum likelihood for generalized linear\nmodels with nested random effects via high-order, multivariate Laplace approximation. Journal\nof Computational and Graphical Statistics, 9(1):141\u2013157, 2000.\n\n[7] G. Rodriguez and N. Goldman. An assessment of estimation procedures for multilevel models\n\nwith binary responses. Journal of Royal Statistical Society, Series A,, 158:73\u201389, 1995.\n\n[8] G. Rodriguez and N. Goldman.\n\nImproved estimation procedures for multilevel models with\nbinary response: A case-study. Journal of the Royal Statistical Society, Series A,, 164(2):339\u2013\n355, 2001.\n\n[9] S. K. Sahu and A. E. Gelfand. Identi\ufb01ability, improper Priors, and Gibbs sampling for general-\n\nized linear models. Journal of the American Statistical Association, 94(445):247\u2013254, 1999.\n\n8\n\n\f", "award": [], "sourceid": 912, "authors": [{"given_name": "Liang", "family_name": "Zhang", "institution": null}, {"given_name": "Deepak", "family_name": "Agarwal", "institution": null}]}