{"title": "Hierarchically Supervised Latent Dirichlet Allocation", "book": "Advances in Neural Information Processing Systems", "page_first": 2609, "page_last": 2617, "abstract": "We introduce hierarchically supervised latent Dirichlet allocation (HSLDA), a model for hierarchically and multiply labeled bag-of-word data. Examples of such data include web pages and their placement in directories, product descriptions and associated categories from product hierarchies, and free-text clinical records and their assigned diagnosis codes. Out-of-sample label prediction is the primary goal of this work, but improved lower-dimensional representations of the bag-of-word data are also of interest. We demonstrate HSLDA on large-scale data from clinical document labeling and retail product categorization tasks. We show that leveraging the structure from hierarchical labels improves out-of-sample label prediction substantially when compared to models that do not.", "full_text": "Hierarchically Supervised Latent Dirichlet Allocation\n\nAdler Perotte\n\nNicholas Bartlett\n\n{ajp9009@dbmi,bartlett@stat,noemie@dbmi,fwood@stat}.columbia.edu\n\nColumbia University, New York, NY 10027, USA\n\nNo\u00b4emie Elhadad\n\nFrank Wood\n\nAbstract\n\nWe introduce hierarchically supervised latent Dirichlet allocation (HSLDA), a\nmodel for hierarchically and multiply labeled bag-of-word data. Examples of such\ndata include web pages and their placement in directories, product descriptions\nand associated categories from product hierarchies, and free-text clinical records\nand their assigned diagnosis codes. Out-of-sample label prediction is the primary\ngoal of this work, but improved lower-dimensional representations of the bag-\nof-word data are also of interest. We demonstrate HSLDA on large-scale data\nfrom clinical document labeling and retail product categorization tasks. We show\nthat leveraging the structure from hierarchical labels improves out-of-sample label\nprediction substantially when compared to models that do not.\n\n1\n\nIntroduction\n\nThere exist many sources of unstructured data that have been partially or completely categorized\nby human editors. In this paper we focus on unstructured text data that has been, at least in part,\nmanually categorized. Examples include but are not limited to webpages and curated hierarchical\ndirectories of the same [1], product descriptions and catalogs, and patient records and diagnosis\ncodes assigned to them for bookkeeping and insurance purposes. In this work we show how to\ncombine these two sources of information using a single model that allows one to categorize new\ntext documents automatically, suggest labels that might be inaccurate, compute improved similari-\nties between documents for information retrieval purposes, and more. The models and techniques\nthat we develop in this paper are applicable to other data as well, namely, any unstructured repre-\nsentations of data that have been hierarchically classi\ufb01ed (e.g., image catalogs with bag-of-feature\nrepresentations).\nThere are several challenges entailed in incorporating a hierarchy of labels into the model. Among\nthem, given a large set of potential labels (often thousands), each instance has only a small number\nof labels associated to it. Furthermore, there are no naturally occurring negative labeling in the data,\nand the absence of a label cannot always be interpreted as a negative labeling.\nOur work operates within the framework of topic modeling. Our approach learns topic models of the\nunderlying data and labeling strategies in a joint model, while leveraging the hierarchical structure\nof the labels. For the sake of simplicity, we focus on \u201cis-a\u201d hierarchies, but the model can be applied\nto other structured label spaces. We extend supervised latent Dirichlet allocation (sLDA) [6] to\ntake advantage of hierarchical supervision. We propose an ef\ufb01cient way to incorporate hierarchical\ninformation into the model. We hypothesize that the context of labels within the hierarchy provides\nvaluable information about labeling.\nWe demonstrate our model on large, real-world datasets in the clinical and web retail domains. We\nobserve that hierarchical information is valuable when incorporated into the learning and improves\nour primary goal of multi-label classi\ufb01cation. Our results show that a joint, hierarchical model\noutperforms a classi\ufb01cation with unstructured labels as well as a disjoint model, where the topic\nmodel and the hierarchical classi\ufb01cation are inferred independently of each other.\n\n1\n\n\fFigure 1: HSLDA graphical model\n\nThe remainder of this paper is as follows. Section 2 introduces hierarchically supervised LDA\n(HSLDA), while Section 3 details a sampling approach to inference in HSLDA. Section 4 reviews\nrelated work, and Section 5 shows results from applying HSLDA to health care and web retail data.\n\n2 Model\n\numents and let the size of the vocabulary be V = |\u03a3|. Let the set of labels be L =(cid:8)l1, l2, . . . , l|L|(cid:9).\n\nHSLDA is a model for hierarchically, multiply-labeled, bag-of-word data. We will refer to individual\ngroups of bag-of-word data as documents. Let wn,d \u2208 \u03a3 be the nth observation in the dth document.\nLet wd = {w1,d, . . . , w1,Nd} be the set of Nd observations in document d. Let there be D such doc-\nEach label l \u2208 L, except the root, has a parent pa(l) \u2208 L also in the set of labels. We will for expo-\nsition purposes assume that this label set has hard \u201cis-a\u201d parent-child constraints (explained later),\nalthough this assumption can be relaxed at the cost of more computationally complex inference.\nSuch a label hierarchy forms a multiply rooted tree. Without loss of generality we will consider a\ntree with a single root r \u2208 L. Each document has a variable yl,d \u2208 {\u22121, 1} for every label which\nindicates whether the label is applied to document d or not. In most cases yi,d will be unobserved,\nin some cases we will be able to \ufb01x its value because of constraints on the label hierarchy, and in the\nrelatively minor remainder its value will be observed. In the applications we consider, only positive\nlabels are observed.\nThe constraints imposed by an is-a label hierarchy are that if the lth label is applied to document\nd, i.e., yl,d = 1, then all labels in the label hierarchy up to the root are also applied to document d,\ni.e., ypa(l),d = 1, ypa(pa(l)),d = 1, . . . , yr,d = 1. Conversely, if a label l(cid:48) is marked as not applying\nto a document then no descendant of that label may be applied to the same. We assume that at least\none label is applied to every document. This is illustrated in Figure 1 where the root label is always\napplied but only some of the descendant labelings are observed as having been applied (diagonal\nhashing indicates that potentially some of the plated variables are observed).\nIn HSLDA, documents are modeled using the LDA mixed-membership mixture model with global\ntopic estimation. Label responses are generated using a conditional hierarchy of probit regressors.\nThe HSLDA graphical model is given in Figure 1. In the model, K is the number of LDA \u201ctopics\u201d\n(distributions over the elements of \u03a3), \u03c6k is a distribution over \u201cwords,\u201d \u03b8d is a document-speci\ufb01c\ndistribution over topics, \u03b2 is a global distribution over topics, DirK(\u00b7) is a K-dimensional Dirichlet\ndistribution, NK(\u00b7) is the K-dimensional Normal distribution, IK is the K dimensional identity\nmatrix, 1d is the d-dimensional vector of all ones, and I(\u00b7) is an indicator function that takes the\nvalue 1 if its argument is true and 0 otherwise. The following procedure describes how to generate\nfrom the HSLDA generative model.\n\n2\n\n\f1. For each topic k = 1, . . . , K\n\n\u2022 Draw a distribution over words \u03c6k \u223c DirV (\u03b31V )\n\n2. For each label l \u2208 L\n\n\u2022 Draw a label application coef\ufb01cient \u03b7l | \u00b5, \u03c3 \u223c NK(\u00b51K, \u03c3IK)\n\n3. Draw the global topic proportions \u03b2 | \u03b1(cid:48) \u223c DirK (\u03b1(cid:48)1K)\n4. For each document d = 1, . . . , D\n\n\u2022 Draw topic proportions \u03b8d | \u03b2, \u03b1 \u223c DirK (\u03b1\u03b2)\n\u2022 For n = 1, . . . , Nd\n\n\u2013 Draw topic assignment zn,d | \u03b8d \u223c Multinomial(\u03b8d)\n\u2013 Draw word wn,d | zn,d, \u03c61:K \u223c Multinomial(\u03c6zn,d\n)\n\n\u2022 Set yr,d = 1\n\u2022 For each label l in a breadth \ufb01rst traversal of L starting at the children of root r\n\n\u2013 Draw al,d | \u00afzd, \u03b7l, ypa(l),d \u223c\n\u2013 Apply label l to document d according to al,d\n\nd \u03b7l, 1),\nd \u03b7l, 1)I(al,d < 0),\n\nypa(l),d = 1\nypa(l),d = \u22121\n\n(cid:26)N (\u00afzT\n(cid:26) 1\n\nN (\u00afzT\n\n\u22121\n\nyl,d | al,d =\n\nif al,d > 0\notherwise\n\n(cid:80)Nd\n\nd\n\nn=1\n\nI(zn,d = k).\n\nHere \u00afzT\nd = [\u00afz1, . . . , \u00afzk, . . . , \u00afzK] is the empirical topic distribution for document d, in which\neach entry is the percentage of the words in that document that come from topic k, \u00afzk =\nN\u22121\nThe second half of step 4 is a substantial part of our contribution to the general class of supervised\nLDA models. Here, each document is labeled generatively using a hierarchy of conditionally depen-\ndent probit regressors [14]. For every label l \u2208 L, both the empirical topic distribution for document\nd and whether or not its parent label was applied (i.e. I(ypa(l),d = 1)) are used to determine whether\nor not label l is to be applied to document d as well. Note that label yl,d can only be applied to\ndocument d if its parent label pa(l) is also applied (these expressions are speci\ufb01c to is-a constraints\nbut can be modi\ufb01ed to accommodate different constraints). The regression coef\ufb01cients \u03b7l are inde-\npendent a priori, however, the hierarchical coupling in this model induces a posteriori dependence.\nThe net effect of this is that label predictors deeper in the label hierarchy are able to focus on \ufb01nding\nspeci\ufb01c, conditional labeling features. We believe this to be a signi\ufb01cant source of the empirical\nlabel prediction improvement we observe experimentally. We test this hypothesis in Section 5.\nNote that the choice of variables al,d and how they are distributed were driven at least in part by\nposterior inference ef\ufb01ciency considerations. In particular, choosing probit-style auxiliary variable\ndistributions for the al,d\u2019s yields conditional posterior distributions for both the auxiliary variables\n(3) and the regression coef\ufb01cients (2) which are analytic. This simpli\ufb01es posterior inference sub-\nstantially.\nIn the common case where no negative labels are observed (like the example applications we con-\nsider in Section 5), the model must be explicitly biased towards generating data that has negative\nlabels in order to keep it from learning to assign all labels to all documents. This is a common\nproblem in modeling unbalanced data. To see how this model can be biased in this way we draw\nthe reader\u2019s attention to the \u00b5 parameter and, to a lesser extent, the \u03c3 parameter above. Because \u00afzd\nis always positive, setting \u00b5 to a negative value results in a bias towards negative labelings, i.e. for\nlarge negative values of \u00b5, all labels become a priori more likely to be negative (yl,d = \u22121). We\nexplore the ability of \u00b5 to bias out-of-sample label prediction performance in Section 5.\n\n3\n\nInference\n\nIn this section we provide the conditional distributions required to draw samples from the HSLDA\nposterior distribution using Gibbs sampling and Markov chain Monte Carlo. Note that, like in\ncollapsed Gibbs samplers for LDA [16], we have analytically marginalized out the parameters \u03c61:K\n\n3\n\n\fand \u03b81:D in the following expressions. Let a be the set of all auxiliary variables, w the set of all\nwords, \u03b7 the set of all regression coef\ufb01cients, and z\\zn,d the set z with element zn,d removed. The\nconditional posterior distribution of the latent topic indicators is\n\n(cid:17) ck,\u2212(n,d)\np (zn,d = k | z\\zn,d, a, w, \u03b7, \u03b1, \u03b2, \u03b3) \u221d\n(cid:16)\n\nck,\u2212(n,d)\n(\u00b7),d\n\n+ \u03b1\u03b2k\n\n(cid:16)\n\nwn,d ,(\u00b7) +\u03b3\nck,\u2212(n,d)\n(\u00b7),(\u00b7) +V \u03b3\n\n(cid:17)(cid:81)\n\n(cid:26)\n\n(cid:27)\n\nl\u2208Ld\n\nexp\n\nd \u03b7l\u2212al,d)2\n\n\u2212 (\u00afzT\n\n2\n\n(1)\n\nv,d\n\nwn,d,(\u00b7) =(cid:80)\n\nwhere ck,\u2212(n,d)\nis the number of words of type v in document d assigned to topic k omitting the\nnth word of document d. The subscript (\u00b7)\u2019s indicate to sum over the range of the replaced variable,\nd ck,\u2212(n,d)\ni.e. ck,\u2212(n,d)\nThe conditional posterior distribution of the regression coef\ufb01cients is given by\n\nwn,d,d . Here Ld is the set of labels which are observed for document d.\n\nwhere\n\n(cid:16)\n\n1\n\n\u02c6\u00b5l = \u02c6\u03a3\n\n(cid:17)\n\n\u00b5\n\u03c3\n\n+ \u00afZT al\n\np(\u03b7l | z, a, \u03c3) = N ( \u02c6\u00b5l, \u02c6\u03a3)\n\n\u02c6\u03a3\u22121 = I\u03c3\u22121 + \u00afZT \u00afZ.\n\n(2)\n\nHere \u00afZ is a D \u00d7 K matrix such that row d of \u00afZ is \u00afzd, and al = [al,1, al,2, . . . , al,D]T . The sim-\nplicity of this conditional distribution follows from the choice of probit regression [4]; the speci\ufb01c\nform of the update is a standard result from Bayesian normal linear regression [14]. It also is a stan-\ndard probit regression result that the conditional posterior distribution of al,d is a truncated normal\ndistribution [4].\np (al,d | z, Y, \u03b7) \u221d\n\n(cid:1)(cid:9) I (al,dyl,d > 0) I(al,d < 0),\n(cid:1)(cid:9) I (al,dyl,d > 0) ,\n\n(cid:0)al,d \u2212 \u03b7T\n(cid:0)al,d \u2212 \u03b7T\n\n(cid:26)exp(cid:8)\u2212 1\nexp(cid:8)\u2212 1\n\nypa(l),d = \u22121\nypa(l),d = 1\n\n(3)\n\n2\n\nl \u00afzd\nl \u00afzd\n\n2\n\nNote that care must be taken to initialize the Gibbs sampler in a valid state.\nHSLDA employs a hierarchical Dirichlet prior over topic assignments (i.e., \u03b2 is estimated from data\nrather than \ufb01xed a priori). This has been shown to improve the quality and stability of inferred topics\n[26]. Sampling \u03b2 is done using the \u201cdirect assignment\u201d method of Teh et al. [25]\n\n(4)\nHere md,k are auxiliary variables that are required to sample the posterior distribution of \u03b2. Their\nconditional posterior distribution is sampled according to\n\n\u03b2 | z, \u03b1(cid:48), \u03b1 \u223c Dir(cid:0)m(\u00b7),1 + \u03b1(cid:48), m(\u00b7),2 + \u03b1(cid:48), . . . , m(\u00b7),K + \u03b1(cid:48).(cid:1)\n(cid:17)\np(cid:0)md,k = m | z, m\u2212(d,k), \u03b2(cid:1) =\n\n(\u03b1\u03b2k)m\n\nck\n(\u00b7),d, m\n\n(cid:16)\n\n(5)\n\n(cid:17) s\n\n(cid:16)\n\n\u0393 (\u03b1\u03b2k)\n\u03b1\u03b2k + ck\n\n(\u00b7),d\n\n\u0393\n\nwhere s (n, m) represents stirling numbers of the \ufb01rst kind.\nThe hyperparameters \u03b1, \u03b1(cid:48), and \u03b3 are sampled using Metropolis-Hastings.\n\n4 Related Work\n\nIn this work we extend supervised latent Dirichlet allocation (sLDA) [6] to take advantage of hierar-\nchical supervision. sLDA is latent Dirichlet allocation (LDA) [7] augmented with per document \u201csu-\npervision,\u201d often taking the form of a single numerical or categorical label. It has been demonstrated\nthat the signal provided by such supervision can result in better, task-speci\ufb01c document models and\ncan also lead to good label prediction for out-of-sample data [6]. It also has been demonstrated\nthat sLDA has been shown to outperform both LASSO (L1 regularized least squares regression) and\nLDA followed by least squares regression [6]. sLDA can be applied to data of the type we consider\nin this paper; however, doing so requires ignoring the hierarchical dependencies amongst the labels.\nIn Section 5 we constrast HSLDA with sLDA applied in this way.\nOther models that incorporate LDA and supervision include LabeledLDA [23] and DiscLDA [18].\nVarious applications of these models to computer vision and document networks have been ex-\nplored [27, 9] . None of these models, however, leverage dependency structure in the label space.\n\n4\n\n\fIn other work, researchers have classi\ufb01ed documents into a hierarchy (a closely related task) with\nnaive Bayes classi\ufb01ers and support vector machines. Most of this work has been demonstrated on\nrelatively small datasets, small label spaces, and has focused on single label classi\ufb01cation without a\nmodel of documents such as LDA [21, 11, 17, 8].\n\n5 Experiments\n\nWe applied HSLDA to data from two domains: predicting medical diagnosis codes from hospital\ndischarge summaries and predicting product categories from Amazon.com product descriptions.\n\n5.1 Data and Pre-Processing\n\n5.1.1 Discharge Summaries and ICD-9 Codes\n\nDischarge summaries are authored by clinicians to summarize patient hospitalization courses. The\nsummaries typically contain a record of patient complaints, \ufb01ndings and diagnoses, along with treat-\nment and hospital course. For each hospitalization, trained medical coders review the information\nin the discharge summary and assign a series of diagnoses codes. Coding follows the ICD-9-CM\ncontrolled terminology, an international diagnostic classi\ufb01cation for epidemiological, health man-\nagement, and clinical purposes.1 The ICD-9 codes are organized in a rooted-tree structure, with\neach edge representing an is-a relationship between parent and child, such that the parent diagnosis\nsubsumes the child diagnosis. For example, the code for \u201cPneumonia due to adenovirus\u201d is a child\nof the code for \u201cViral pneumonia,\u201d where the former is a type of the latter. It is worth noting that the\ncoding can be noisy. Human coders sometimes disagree [3], tend to be more speci\ufb01c than sensitive\nin their assignments [5], and sometimes make mistakes [13].\nThe task of automatic ICD-9 coding has been investigated in the clinical domain. Methods range\nfrom manual rules to online learning [10, 15, 12]. Other work had leveraged larger datasets and\nexperimented with K-nearest neighbor, Naive Bayes, support vector machines, Bayesian Ridge Re-\ngression, as well as simple keyword mappings, all with promising results [19, 24, 22, 20].\nOur dataset was gathered from the NewYork-Presbyterian Hospital clinical data warehouse. It con-\nsists of 6,000 discharge summaries and their associated ICD-9 codes (7,298 distinct codes overall),\nrepresenting all the discharges from the hospital in 2009. All included discharge summaries had\nassociated ICD-9 Codes. Summaries have 8.39 associated ICD-9 codes on average (std dev=5.01)\nand contain an average of 536.57 terms after preprocessing (std dev=300.29). We split our dataset\ninto 5,000 discharge summaries for training and 1,000 for testing.\nThe text of the discharge summaries was tokenized with NLTK.2 A \ufb01xed vocabulary was formed\nby taking the top 10,000 tokens with highest document frequency (exclusive of names, places and\nother identifying numbers). The study was approved by the Institutional Review Board and follows\nHIPAA (Health Insurance Portability and Accountability Act) privacy guidelines.\n\n5.1.2 Product Descriptions and Categorizations\n\nAmazon.com, an online retail store, organizes its catalog of products in a mulitply-rooted hierar-\nchy and provides textual product descriptions for most products. Products can be discovered by\nusers through free-text search and product category exploration. Top-level product categories are\ndisplayed on the front page of the website and lower level categories can be discovered by choosing\none of the top-level categories. Products can exist in multiple locations in the hierarchy.\nIn this experiment, we obtained Amazon.com product categorization data from the Stanford Net-\nwork Analysis Platform (SNAP) dataset [2]. Product descriptions were obtained separately from\nthe Amazon.com website directly. We limited our dataset to the collection of DVDs in the product\ncatalog.\nOur dataset contains 15,130 product descriptions for training and 1,000 for testing. The product\ndescriptions are shorter than the discharge summaries (91.89 terms on average, std dev=53.08).\n\n1http://www.cdc.gov/nchs/icd/icd9cm.htm\n2http://www.nltk.org\n\n5\n\n\fOverall, there are 2,691 unique categories. Products are assigned on average 9.01 categories (std\ndev=4.91). The vocabulary consists of the most frequent 30,000 words omitting stopwords.\n\n5.2 Comparison Models\n\nWe evaluated HSLDA along with two closely related models against the two datasets. The compari-\nson models included sLDA with independent regressors (hierarchical constraints on labels ignored)\nand HSLDA \ufb01t by \ufb01rst performing LDA then \ufb01tting tree-conditional regressions. These models were\nchosen to highlight several aspects of HSLDA including performance in the absence of hierarchical\nconstraints, the effect of the combined inference, and regression performance attributable solely to\nthe hierarchical constraints.\nsLDA with independent regressors is the most salient comparison model for our work. The distin-\nguishing factor between HSLDA and sLDA is the additional structure imposed on the label space, a\ndistinction that we hypothesized would result in a difference in predictive performance.\nThere are two components to HSLDA, LDA and a hierarchically constrained response. The second\ncomparison model is HSLDA \ufb01t by performing LDA \ufb01rst followed by performing inference over the\nhierarchically constrained label space. In this comparison model, the separate inference processes\ndo not allow the responses to in\ufb02uence the low dimensional structure inferred by LDA. Combined\ninference has been shown to improve performance in sLDA [6]. This comparison model examines\nnot the structuring of the label space, but the bene\ufb01t of combined inference over both the documents\nand the label space.\nFor all three models, particular attention was given to the settings of the prior parameters for the\nregression coef\ufb01cients. These parameters implement an important form of regularization in HSLDA.\nIn the setting where there are no negative labels, a Gaussian prior over the regression parameters\nwith a negative mean implements a prior belief that missing labels are likely to be negative. Thus,\nwe evaluated model performance for all three models with a range of values for \u00b5, the mean prior\nparameter for regression coef\ufb01cients (\u00b5 \u2208 {\u22123,\u22122.8,\u22122.6, . . . , 1}).\nThe number of topics for all models was set to 50, the prior distributions of p (\u03b1), p (\u03b1(cid:48)), and p (\u03b3)\nwere gamma distributed with a shape parameter of 1 and a scale parameters of 1000.\n\n(a) Clinical data performance.\n\n(b) Retail product performance.\n\nFigure 2: ROC curves for out-of-sample label prediction varying \u00b5, the prior mean of the regression\nparameters. In both \ufb01gures, solid is HSLDA, dashed are independent regressors + sLDA (hierar-\nchical constraints on labels ignored), and dotted is HSLDA \ufb01t by running LDA \ufb01rst then running\ntree-conditional regressions.\n\n5.3 Evaluation and Results\n\nWe evaluated our model, HSLDA, against the comparison models with a focus on predictive perfor-\nmance on held-out data. Prediction performance was measured with standard metrics \u2013 sensitivity\n(true positive rate) and 1-speci\ufb01city (false positive rate).\n\n6\n\n0.00.20.40.60.81.01-Speci\ufb01city0.00.20.40.60.81.0SensitivityHSLDAsLDALDA+conditionalregression0.00.20.40.60.81.01-Speci\ufb01city0.00.20.40.60.81.0SensitivityHSLDAsLDALDA+conditionalregression\fFigure 3: ROC curve for out-of-sample ICD-9 code prediction varying auxiliary variable threshold.\n\u00b5 = \u22121.0 for all three models in this \ufb01gure.\n\nThe gold standard for comparison was derived from the testing set in each dataset. To make the\ncomparison as fair as possible among models, ancestors of observed nodes in the label hierarchy\nwere ignored, observed nodes were considered positive and descendents of observed nodes were\nconsidered to be negative. Note that this is different from our treatment of the observations dur-\ning inference. Since the sLDA model does not enforce the hierarchical constraints, we establish a\nmore equal footing by considering only the observed labels as being positive, despite the fact that,\nfollowing the hierarchical constraints, ancestors must also be positive. Such a gold standard will\nlikely in\ufb02ate the number of false positives because the labels applied to any particular document are\nusually not as complete as they could be. ICD-9 codes, for instance, lack sensitivity and their use\nas a gold standard could lead to correctly positive predictions being labeled as false positives [5].\nHowever, given that the label space is often large (as in our examples) it is a moderate assumption\nthat erroneous false positives should not skew results signi\ufb01cantly.\n\n(cid:16)\n\n(cid:17)\n\nyl, \u02c6d | w1:N \u02c6d, \u02c6d, w1:Nd,1:D, yl\u2208L,1:D\n\nPredictive performance in HSLDA is evaluated by p\nfor each\ntest document, \u02c6d. For ef\ufb01ciency, the expectation of this probability distribution was estimated in the\nfollowing way. Expectations of \u00afz \u02c6d and \u03b7l were estimated with samples from the posterior. Using\nthese expectations, we performed Gibbs sampling over the hierarchy to acquire predictive samples\nfor the documents in the test set. The true positive rate was calculated as the average expected\nlabeling for gold standard positive labels. The false positive rate was calculated as the average\nexpected labeling for gold standard negative labels.\nAs sensitivity and speci\ufb01city can always be traded off, we examined sensitivity for a range of values\nfor two different parameters \u2013 the prior means for the regression coef\ufb01cients and the threshold for\nthe auxiliary variables. The goal in this analysis was to evaluate the performance of these models\nsubject to more or less stringent requirements for predicting positive labels. These two parameters\nhave important related functions in the model. The prior mean in combination with the auxiliary\nvariable threshold together encode the strength of the prior belief that unobserved labels are likely to\nbe negative. Effectively, the prior mean applies negative pressure to the predictions and the auxiliary\nvariable threshold determines the cutoff. For each model type, separate models were \ufb01t for each\nvalue of the prior mean of the regression coef\ufb01cients. This is a proper Bayesian sensitivity analysis.\nIn contrast, to evaluate predictive performance as a function of the auxiliary variable threshold, a\nsingle model was \ufb01t for each model type and prediction was evaluated based on predictive samples\ndrawn subject to different auxiliary variable thresholds. These methods are signi\ufb01cantly different\nsince the prior mean is varied prior to inference, and the auxiliary variable threshold is varied fol-\nlowing inference.\nFigure 2(a) demonstrates the performance of the model on the clinical data as an ROC curve varying\n\u00b5. For instance, a hyperparameter setting of \u00b5 = \u22121.6 yields the following performance: the full\nHSLDA model had a true positive rate of 0.57 and a false positive rate of 0.13, the sLDA model had\n\n7\n\n0.00.20.40.60.81.01-Speci\ufb01city0.00.20.40.60.81.0SensitivityHSLDAsLDALDA+conditionalregression\fa true positive rate of 0.42 and a false positive rate of 0.07, and the HSLDA model where LDA and\nthe regressions were \ufb01t separately had a true positive rate of 0.39 and a false positive rate of 0.08.\nThese points are highlighted in Figure 2(a).\nThese results indicate that the full HSLDA model predicts more of the the correct labels at a cost of\nan increase in the number of false positives relative to the comparison models.\nFigure 2(b) demonstrates the performance of the model on the retail product data as an ROC curve\nalso varying \u00b5. For instance, a hyperparameter setting of \u00b5 = \u22122.2 yields the following perfor-\nmance: the full HSLDA model had a true positive rate of 0.85 and a false positive rate of 0.30, the\nsLDA model had a true positive rate of 0.78 and a false positive rate of 0.14, and the HSLDA model\nwhere LDA and the regressions were \ufb01t separately had a true positive rate of 0.77 and a false positive\nrate of 0.16. These results follow a similar pattern to the clinical data. These points are highlighted\nin Figure 2(b).\nFigure 3 shows the predictive performance of HSLDA relative to the two comparison models on\nthe clinical dataset as a function of the auxiliary variable threshold. For low values of the auxiliary\nvariable threshold, the models predict labels in a more sensitive and less speci\ufb01c manner, creating\nthe points in the upper right corner of the ROC curve. As the auxiliary variable threshold is in-\ncreased, the models predict in a less sensitive and more speci\ufb01c manner, creating the points in the\nlower left hand corner of the ROC curve. HSLDA with full joint inference outperforms sLDA with\nindependent regressors as well as HSLDA with separately trained regression.\n\n6 Discussion\n\nThe SLDA model family, of which HSLDA is a member, can be understood in two different ways.\nOne way is to see it as a family of topic models that improve on the topic modeling performance\nof LDA via the inclusion of observed supervision. An alternative, complementary way is to see it\nas a set of models that can predict labels for bag-of-word data. A large diversity of problems can\nbe expressed as label prediction problems for bag-of-word data. A surprisingly large amount of that\nkind of data possess structured labels, either hierarchically constrained or otherwise. That HSLDA\ndirectly addresses this kind of data is a large part of the motivation for this work. That it outperforms\nmore straightforward approaches should be of interest to practitioners.\nVariational Bayes has been the predominant estimation approach applied to sLDA models. Hierar-\nchical probit regression makes for tractable Markov chain Monte Carlo SLDA inference, a bene\ufb01t\nthat should extend to other sLDA models should probit regression be used for response variable\nprediction there too.\nThe results in Figures 2(a) and 2(b) suggest that in most cases it is better to do full joint estimation\nof HSLDA. An alternative interpretation of the same results is that, if one is more sensitive to\nthe performance gains that result from exploiting the structure of the labels, then one can, in an\nengineering sense, get nearly as much gain in label prediction performance by \ufb01rst \ufb01tting LDA\nand then \ufb01tting a hierarchical probit regression. There are applied settings in which this could be\nadvantageous.\nExtensions to this work include unbounded topic cardinality variants and relaxations to different\nkinds of label structure. Unbounded topic cardinality variants pose interesting inference challenges.\nUtilizing different kinds of label structure is possible within this framework, but requires relaxing\nsome of the simpli\ufb01cations we made in this paper for expositional purposes.\n\nReferences\n[1] DMOZ open directory project. http://www.dmoz.org/, 2002.\n\n[2] Stanford network analysis platform. http://snap.stanford.edu/, 2004.\n\n[3] The computational medicine center\u2019s 2007 medical natural\n\nhttp://www.computationalmedicine.org/challenge/previous, 2007.\n\nlanguage processing challenge.\n\n[4] J. Albert and S. Chib. Bayesian analysis of binary and polychotomous response data. Journal of the\n\nAmerican Statistical Association, 88(422):669, 1993.\n\n8\n\n\f[5] E. Birman-Deych, A. D. Waterman, Y. Yan, D. S. Nilasena, M. J. Radford, and B. F. Gage. Accuracy\nof ICD-9-CM codes for identifying cardiovascular and stroke risk factors. Medical Care, 43(5):480\u20135,\n2005.\n\n[6] D. Blei and J. McAuliffe. Supervised topic models. Advances in Neural Information Processing, 20:\n\n121\u2013128, 2008.\n\n[7] D. Blei, A.Y. Ng, and M.I. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res., 3:993\u20131022, March\n\n2003. ISSN 1532-4435.\n\n[8] S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Scalable feature selection, classi\ufb01cation and\nsignature generation for organizing large text databases into hierarchical topic taxonomies. The VLDB\nJournal, 7:163\u2013178, August 1998. ISSN 1066-8888.\n\n[9] J. Chang and D. M. Blei. Hierarchical relational models for document networks. Annals of Applied\n\nStatistics, 4:124\u2013150, 2010. doi: 10.1214/09-AOAS309.\n\n[10] K. Crammer, M. Dredze, K. Ganchev, P.P. Talukdar, and S. Carroll. Automatic code assignment to medical\ntext. Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language\nProcessing, pages 129\u2013136, 2007.\n\n[11] S. Dumais and H. Chen. Hierarchical classi\ufb01cation of web content. In Proceedings of the 23rd annual\ninternational ACM SIGIR conference on Research and development in information retrieval, SIGIR \u201900,\npages 256\u2013263, New York, NY, USA, 2000. ACM.\n\n[12] R. Farkas and G. Szarvas. Automatic construction of rule-based ICD-9-CM coding systems. BMC bioin-\n\nformatics, 9(Suppl 3):S10, 2008.\n\n[13] M. Farzandipour, A. Sheikhtaheri, and F. Sadoughi. Effective factors on accuracy of principal diagno-\nsis coding based on international classi\ufb01cation of diseases, the 10th revision. International Journal of\nInformation Management, 30:78\u201384, 2010.\n\n[14] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman and Hall/CRC,\n\n2nd ed. edition, 2004.\n\n[15] I. Goldstein, A. Arzumtsyan, and \u00a8O. Uzuner. Three approaches to automatic assignment of ICD-9-CM\n\ncodes to radiology reports. AMIA Annual Symposium Proceedings, 2007:279, 2007.\n\n[16] T. L. Grif\ufb01ths and M. Steyvers. Finding scienti\ufb01c topics. PNAS, 101(suppl. 1):5228\u20135235, 2004.\n[17] D. Koller and M. Sahami. Hierarchically classifying documents using very few words. Technical Report\n\n1997-75, Stanford InfoLab, February 1997. Previous number = SIDL-WP-1997-0059.\n\n[18] S. Lacoste-Julien, F. Sha, and M. I. Jordan. DiscLDA: Discriminative learning for dimensionality reduc-\n\ntion and classi\ufb01cation. In Neural Information Processing Systems, pages 897\u2013904.\n\n[19] L. Larkey and B. Croft. Automatic assignment of ICD9 codes to discharge summaries. Technical report,\n\nUniversity of Massachussets, 1995.\n\n[20] L. V. Lita, S. Yu, S. Niculescu, and J. Bi. Large scale diagnostic code classi\ufb01cation for medical pa-\ntient records. In Proceedings of the 3rd International Joint Conference on Natural Language Processing\n(IJCNLP\u201908), 2008.\n\n[21] A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Building domain-speci\ufb01c search engines with\nmachine learning techniques. In Proc. AAAI-99 Spring Symposium on Intelligent Agents in Cyberspace,\n1999.\n\n[22] S. Pakhomov, J. Buntrock, and C. Chute. Automating the assignment of diagnosis codes to patient encoun-\nters using example-based and machine learning techniques. Journal of the American Medical Informatics\nAssociation (JAMIA), 13(5):516\u2013525, 2006.\n\n[23] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. Labeled LDA: a supervised topic model for credit\nattribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in\nNatural Language Processing, pages 248\u2013256, 2009.\n\n[24] B. Ribeiro-Neto, A. Laender, and L. De Lima. An experimental study in automatically categorizing\nmedical documents. Journal of the American society for Information science and Technology, 52(5):\n391\u2013401, 2001.\n\n[25] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the\n\nAmerican Statistical Association, 101(476):1566\u20131581, 2006.\n\n[26] H. Wallach, D. Mimno, and A. McCallum. Rethinking LDA: Why priors matter. In Y. Bengio, D. Schuur-\nmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing\nSystems 22, pages 1973\u20131981. 2009.\n\n[27] C. Wang, D. Blei, and L. Fei-Fei. Simultaneous image classi\ufb01cation and annotation. In CVPR, pages\n\n1903\u20131910, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1424, "authors": [{"given_name": "Adler", "family_name": "Perotte", "institution": null}, {"given_name": "Frank", "family_name": "Wood", "institution": null}, {"given_name": "Noemie", "family_name": "Elhadad", "institution": null}, {"given_name": "Nicholas", "family_name": "Bartlett", "institution": null}]}