{"title": "Inferring Generative Model Structure with Static Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 240, "page_last": 250, "abstract": "Obtaining enough labeled data to robustly train complex discriminative models is a major bottleneck in the machine learning pipeline. A popular solution is combining multiple sources of weak supervision using generative models. The structure of these models affects the quality of the training labels, but is difficult to learn without any ground truth labels. We instead rely on weak supervision sources having some structure by virtue of being encoded programmatically. We present Coral, a paradigm that infers generative model structure by statically analyzing the code for these heuristics, thus significantly reducing the amount of data required to learn structure. We prove that Coral's sample complexity scales quasilinearly with the number of heuristics and number of relations identified, improving over the standard sample complexity, which is exponential in n for learning n-th degree relations. Empirically, Coral matches or outperforms traditional structure learning approaches by up to 3.81 F1 points. Using Coral to model dependencies instead of assuming independence results in better performance than a fully supervised model by 3.07 accuracy points when heuristics are used to label radiology data without ground truth labels.", "full_text": "Inferring Generative Model Structure\n\nwith Static Analysis\n\nParoma Varma1, Bryan He2, Payal Bajaj2,\n\nNishith Khandwala2, Imon Banerjee3, Daniel Rubin3,4, Christopher R\u00e92\n\n1Electrical Engineering, 2Computer Science, 3Biomedical Data Science, 4Radiology\n\n{paroma,bryanhe,pabajaj,nishith,imonb,rubin}@stanford.edu,\n\nchrismre@cs.stanford.edu\n\nStanford University\n\nAbstract\n\nObtaining enough labeled data to robustly train complex discriminative models is a\nmajor bottleneck in the machine learning pipeline. A popular solution is combining\nmultiple sources of weak supervision using generative models. The structure of these\nmodels affects the quality of the training labels, but is dif\ufb01cult to learn without any\nground truth labels. We instead rely on weak supervision sources having some struc-\nture by virtue of being encoded programmatically. We present Coral, a paradigm\nthat infers generative model structure by statically analyzing the code for these\nheuristics, thus signi\ufb01cantly reducing the amount of data required to learn structure.\nWe prove that Coral\u2019s sample complexity scales quasilinearly with the number of\nheuristics and number of relations identi\ufb01ed, improving over the standard sample\ncomplexity, which is exponential in n for learning nth degree relations. Empirically,\nCoral matches or outperforms traditional structure learning approaches by up to\n3.81 F1 points. Using Coral to model dependencies instead of assuming indepen-\ndence results in better performance than a fully supervised model by 3.07 accuracy\npoints when heuristics are used to label radiology data without ground truth labels.\n\n1\n\nIntroduction\n\nComplex discriminative models like deep neural networks rely on a large amount of labeled training\ndata for their success. For many real-world applications, obtaining this magnitude of labeled\ndata is one of the most expensive and time consuming aspects of the machine learning pipeline.\nRecently, generative models have been used to create training labels from various weak supervision\nsources, such as heuristics or knowledge bases, by modeling the true class label as a latent variable\n[1, 2, 27, 31, 36, 37]. After the necessary parameters for the generative models are learned using\nunlabeled data, the distribution over the true labels can be inferred. Properly specifying the structure\nof these generative models is essential in estimating the accuracy of the supervision sources. While\ntraditional structure learning approaches have focused on the supervised case [23, 28, 41], previous\nworks related to weak supervision assume that the structure is user-speci\ufb01ed [1, 27, 31, 36]. Recently,\nBach et al. [2] showed that it is possible to learn the structure of these models with a sample complexity\nthat scales sublinearly with the number of possible binary dependencies. However, the sample\ncomplexity scales exponentially for higher degree dependencies, limiting its ability to learn complex\ndependency structures. Moreover, the time required to learn the dependencies also grows exponentially\nwith the degree of dependencies, hindering the development of user-de\ufb01ned heuristics.\nThis poses a problem in many domains where high degree dependencies are common among heuristics\nthat operate over a shared set of inputs. These inputs are interpretable characteristics extracted from the\ndata. For example, various approaches in computer vision use predicted bounding box or segmentation\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fattributes [18, 19, 29], like location and size, to weakly supervise more complex image-based learning\ntasks [5, 7, 11, 26, 38]. Another example comes from the medical imaging domain, where attributes\ninclude characteristics such as the area, intensity and perimeter of a tumor, as shown in Figure 1. Note\nthat these attributes are computationally represented, and the heuristics written over them are encoded\nprogrammatically as well. There are typically a relatively small set of interpretable characteristics, so\nthe heuristics often share these attributes. This results in high order dependency structures among these\nsources, which are crucial to model in the generative model that learns accuracies for these sources.\nTo address the issue of learning higher order dependencies ef\ufb01ciently, we present Coral, a paradigm\nthat statically analyzes the source code of the weak supervision sources to infer, rather than learn,\nthe complex relations among heuristics. Coral\u2019s sample complexity scales quasilinearly with the\nnumber of relevant dependencies and does not scale with the degree of the dependency, unlike the\nsample complexity for Bach et al. [2], which scales exponentially with the degree of the dependency.\nMoreover, the time to identify these relations is constant in the degree of dependencies, since it only\nrequires looking at the source code for each heuristic to \ufb01nd which heuristics share the same input.\nThis allows Coral to infer high degree dependencies more ef\ufb01ciently than techniques that rely only\non statistical methods to learn them, and thus generate a more accurate dependency structure for the\nheuristics. Coral then uses a generative model to learn the proper weights for this dependency structure\nto assign probabilistic labels to training data.\nWe experimentally validate the performance of Coral across various domains and show it outperforms\ntraditional structure learning under various conditions while being signi\ufb01cantly more computationally\nef\ufb01cient. We show how modeling dependencies leads to an improvement of 3.81 F1 points compared\nto standard structure learning approaches. Additionally, we show that Coral can assign labels to data\nthat have no ground truth labels, and this augmented training set results in improving the discriminative\nmodel performance by 3.07 points. For a complex relation-based image classi\ufb01cation task, 6 heuristic\nfunctions written over only bounding box attributes as primitives are able to train a model that\nperforms within 0.74 points of the F1 score achieved by a fully-supervised model trained on the rich,\nhand-labeled attribute and relation information in the Visual Genome database [21].\n\n2 The Coral Paradigm\n\nThe Coral paradigm takes as input a set of domain-speci\ufb01c primitives and a set of programmatic\nuser-de\ufb01ned heuristic functions that operate over the primitives. We formally de\ufb01ne these abstractions\nin Section 2.1. Coral runs static analysis on the source code that de\ufb01nes the primitives and the heuristic\nfunctions to identify which sets of heuristics are related by virtue of sharing primitives (Section 2.2).\nOnce Coral identi\ufb01es these dependencies, it uses a factor graph to model the relationship between\nthe heuristics, primitives and the true class label. We describe the conditions under which Coral can\nlearn the structure of the generative model with signi\ufb01cantly less data than traditional approaches\nin Section 2.3 and demonstrate how this affects generative model accuracy via simulations. Finally,\nwe discuss how Coral learns the accuracies of the each heuristic and outputs probabilistic labels for\nthe training data in Section 2.4.\n\nFigure 1: Running example for the Coral paradigm. Users apply standard algorithms to segment tumors\nfrom the X-ray and extract the domain-speci\ufb01c primitives from the image and segmentation. They\nwrite heuristic functions over the primitives that output a noisy label for each image. The generative\nmodel takes these as inputs and provides probabilistic training labels for the discriminative model.\n\n2\n\np1:Areap2:Perimeterp3:Intensity\u03bb1(p1)\u03bb2(p2)\u03bb3(p2, p3)Domain Specific PrimitivesRaw ImagesHeuristicFunctionsSegmentationsAggressiveNon-AggressiveAggressiveNoisy LabelsGenerative ModelPrimitives and Heuristics are User-DefinedTraining LabelsDiscriminativeModelThe Coral Pipeline75% Aggressive75% Aggressive\f2.1 Coral Abstractions\n\nDomain-Speci\ufb01c Primitives Domain-speci\ufb01c primitives (DSPs) in Coral are the simplest elements\nthat heuristic functions take as input and operate over. DSPs in Coral have semantic meaning, making\nthem interpretable for users. This is akin to the concept of language primitives in programming\nlanguages, in which they are the smallest unit of processing with meaning. The motivation for making\nthe DSPs domain-speci\ufb01c instead of a general construct for the various data modalities is to allow users\nto take advantage of existing work in their \ufb01eld to extract meaningful characteristics from the raw data.\nFigure 1 shows an example of a pipeline for bone tumor classi\ufb01cation as aggressive or non-aggressive,\ninspired by one of our real experiments. First, an automated segmentation algorithm is used to generate\na binary mask for where the tumor is [20, 25, 34, 39]. Then, we de\ufb01ne 3 DSPs based on the segmentation:\narea (p1), perimeter (p2) and total intensity (p3) of the segmented area. More complex characteristics\nsuch as those that capture texture, shape and edge features can also be used [4, 14, 22] (see Appendix).\nWe now de\ufb01ne a formal construct for how DSPs are encoded programmatically. Users generate\nDSPs in Coral through a primitive speci\ufb01er function, such as create_primitives in Figure 2(a).\nSpeci\ufb01cally, this function takes as input a single unlabeled data point (and necessary intermediate\nrepresentations such as the segmentation) and returns an instance of PrimitiveSet, which maps\nprimitive names to primitive values, like integers (we refer to a speci\ufb01c instance of this class as P).\nNote that P.ratio is composed of two other primitives, while the rest of the primitives are generated\nindependently from the image and segmentation.\n\nFigure 2: (a) The create_primitives function that generates primitives. (b) Part of the AST for the\ncreate_primitives function. (c) The composition structure that results from traversing the AST.\n\nHeuristic Functions\nIn Coral, heuristic functions (HFs) can be viewed as mapping a subset of the\nDSPs to a noisy label for the training data, as shown in Figure 1. In our experience with user-de\ufb01ned\nHFs, we observe that HFs are usually nested if-then statements in which each statement checks whether\nthe value of a single primitive or a combination of them are above or below a user-set threshold (see\nAppendix). As shown in Figure 3(a), they take \ufb01elds of the object P as input and return a label (or\nabstain) based on the value of the input primitives. While our running example focuses on a single\ndata point for DSP generation and HF construction, both procedures are applied to the entire training\nset to assign a set of noisy labels from each HF to each data point.\n\n2.2 Static Dependency Analysis\n\nSince the number of DSPs in some domains can be relatively small, multiple HFs can operate over\nthe same DSPs. HFs that share at least one primitive are trivially related to each other. Prior work\n[2] learns these dependencies using the labels HFs assign to data points and its probability of success\nscales with the amount of data available. However, only pairwise HF dependencies can be learned\nef\ufb01ciently, since the data required grows exponentially with the degree of the HF relation. This in\nturn limits the complexity of the dependency structure this method can accurately learn and model.\n\nHeuristic Function Inputs Coral takes advantage of the fact that users write HFs over a known,\n\ufb01nite set of primitives. It infers dependencies that exist among HFs by simply looking at the source\ncode of how the DSPs and HFs are constructed. This process requires no data to successfully learn\nthe dependencies, making it more computationally ef\ufb01cient than standard approaches. In order to\ndetermine whether any set of HFs share at least one DSP, Coral looks at the input for each HF. Since\nthe HFs only take as input the DSP they operate over, simply grouping HFs by the primitives they\nshare is an ef\ufb01cient approach for recognizing these dependencies.\n\n3\n\n(a)def create_primitives(image,segmentation):P = PrimitiveSet()P.area = get_area(segmentation)P.perimeter = get_perimeter(segmentation)P.intensity = np.sum(segmentation*image) P.ratio = P.intensity/P.perimeterreturn P(b)(c)P.areaP.intensityP.perimeter/P.ratioP.perimetervalueP.ratiovalueget_perimeter()function/operatorP.intensityvalueP.perimetervalue\fAs shown in our running example, this would result in Coral not recognizing any dependencies among\nthe HFs since the input for all 3 HFs are different (Figure 3(a)). This, however, would be incorrect,\nsince the primitive P.ratio is composed of P.perimeter and P.intensity, which makes \u03bb2\nand \u03bb3 related. Therefore, along with looking at the primitives that each HF takes as input, it is also\nessential to model how these primitives are composed.\n\nPrimitive Compositions We use our running example in Figure 2 to explain how Coral gathers\ninformation about DSP compositions. Coral builds an abstract syntax tree (AST) to represent the\ncomputations the create_primitives function performs. An AST represents operations involving\nthe primitives as a tree, as shown in Figure 2(b). To \ufb01nd primitive compositions from the AST, Coral\n\ufb01rst \ufb01nds the expressions in the AST that add primitives to P (denoted in the AST as P.name). Then,\nfor each assignment expression, Coral traverses the subtree rooted at the assignment expression and\nadds all other encountered primitives as a dependency for P.name. If no primitives are encountered\nin the subtree, the primitive is registered as being independent of the rest. The composition structure\nthat results from traversing the AST is shown in Figure 2(c), where P.area, P.intensity, and\nP.perimeter are independent while P.ratio is a composition.\n\nHeuristic Function Dependency Structure With knowledge of how the DSPs are composed, we\nreturn to our original method of looking at the inputs of the HFs. As before, we identify that \u03bb1 and\n\u03bb2 use P.area and P.perimeter, respectively. However, we now know that \u03bb3 uses P.ratio, which\nis a composition of P.intensity and P.perimeter. This implies that \u03bb3 will be related to any HF\nthat takes either P.intensity, P.perimeter, or both as inputs. We proceed to build a relational\nstructure among the HFs and DSPs. As shown in Figure 3(b), this structure shows which independent\nDSPs each HF operates over. The relational structure implicitly encodes dependency information\nabout the HFs \u2014 if an edge points from one primitive to n HFs, those n HFs are in an n-way relation\nby virtue of sharing that primitive. This dependency information can more formally be encoded in\na factor graph shown in Figure 3(c), which is discussed in Section 2.3. Note that we chose a particular\nprogrammatic setup for creating DSPs and HFs to explain how static analysis can infer dependencies;\nhowever, this process can be modi\ufb01ed to work with other setups that encode DSPs and HFs as well.\n\nFigure 3: (a) shows the encoded HFs. (b) shows the HF dependency structure where DSP nodes have\nan edge going to the HFs that use them as inputs (explicitly or implicitly). (c) shows the factor graph\nCoral uses to model the relationship between HFs, DSPs, and latent class label Y.\n\n2.3 Creating the Generative Model\n\nWe now describe the generative model used to predict the true class labels. The Coral model uses a\nfactor graph (Figure 3(c)) to model the relationship between the primitives (p\u2208R), heuristic functions\n(\u03bb\u2208{\u22121,0,1}) and latent class label (Y \u2208{\u22121,1}). We show that by incorporating information about\nhow primitives are shared across HFs from static analysis, this factor graph infers all dependencies\nbetween the heuristics that are guaranteed to be present. We also describe how Coral recovers additional\ndependencies among the heuristics by studying empirical relationships between the primitives.\n\nModeling Heuristic Function Dependencies Now that dependencies have been inferred via static\nanalysis, the goal is to learn the accuracies for each HF and assign labels to training data accordingly.\n\n4\n\ndef \u03bb_1(P.area):if P.area >= 2.0:return 1else:return-1def \u03bb_2(P.perimeter):if P.perimeter <= 12.0:return1else:return 0\u03bb1\u03bb2\u03bb3P.areaP.perimeterP.intensitydef \u03bb_3(P.ratio):if P.ratio <= 5.0:return1else:return -1(a)(b)(c)\u03bb1\u03bb2\u03bb3p1p2p3\ud835\udf191\ud835\udc3b\ud835\udc39\ud835\udf192\ud835\udc3b\ud835\udc39\ud835\udf193\ud835\udc3b\ud835\udc39Y\ud835\udf192\ud835\udc34\ud835\udc50\ud835\udc50\ud835\udf191\ud835\udc34\ud835\udc50\ud835\udc50\ud835\udf193\ud835\udc34\ud835\udc50\ud835\udc50\ud835\udf19\ud835\udc37\ud835\udc46\ud835\udc43\fThe factor graph thus consists of two types of factors: accuracy factors \u03c6Acc and HF factors from static\nanalysis \u03c6HF.\nThe accuracy factors specify the accuracy of each heuristic function and are de\ufb01ned as\n\n\u03c6Acc\ni\n\n(Y,\u03bbi) = Y \u03bbi, i = 1,...,n\n\nwhere n is the total number of heuristic functions.\nThe static analysis factors ensure that the heuristics are correctly evaluated based on the HF\ndependencies found via static analysis. They ensure that a probability of zero is given to any\ncon\ufb01guration where an HF does not have the correct value given the primitives it depends on. The\nstatic analysis factors are de\ufb01ned as.\n\n(cid:26)0\n\n\u03c6HF\ni (\u03bbi,p1,...,pm) =\n\n\u2212\u221e otherwise\n\nif \u03bbi is valid given p1,...,pm\n\n, i = 1,...,n\n\nSince these factors are obtained directly from static analysis, they can be recovered with no data.\nHowever, we note that static analysis is not suf\ufb01cient to capture all dependencies required in the factor\ngraph to accurately model the process of generating training labels. Speci\ufb01cally, static analysis can\n\n(i) pick up spurious dependencies among HFs that are not truly dependent on each other, or\n(ii) miss key dependencies among HFs that exist due to dependencies among the DSPs in the HFs.\n(i) can occur if some \u03bbA takes as input DSPs pi,pj and \u03bbB takes as input DSPs pi,pk, but pi always has\nthe same value. Although static analysis would pick up that \u03bbA and \u03bbB share a primitive and should\nhave a dependency, this may not be true if pj and pk are independent. (ii) can occur if two HFs depend\non different primitives, but these primitives happen to always have the same value. In this case, it is\nimpossible for static analysis to infer the dependency between the HFs if the primitives have different\nnames and are generated independently, as described in Section 2.2. A more realistic scenario comes\nfrom our running example, where we would expect the area and perimeter of the tumor to be related.\nTo account for both cases, it is necessary to capture the possible dependencies that occur among the\nDSPs to ensure that the dependencies from static analysis do not misspecify the factor graph. We\nintroduce a factor to account for additional dependencies among the primitives, \u03c6DSP. There are many\npossible choices for this dependency factor, but one simple choice is to model pairwise similarity\nbetween the primitives. For binary and discrete primitives, the dependency factor with pairwise\nsimilarity can be represented as\n\n\u03c6DSP(p1,...,pm) =\n\n\u03c6Sim\nij (pi,pj), where \u03c6Sim\n\nij (pi,pj) =I[pi = pj].\n\n(cid:88)\n\ni<j\n\nThe dependency factor can be generalized to continuous-valued primitives by binning the primitives\ninto discrete values before comparing for similarity.\nFinally, with three types of factors, the probability distribution speci\ufb01ed by the factor graph is\n\nP (y,\u03bb1,...,\u03bbn,p1,...,pm)\u221d exp\n\n\u03b8Acc\ni \u03c6Acc\n\ni +\n\n\u03c6HF\ni +\n\n\u03b8Sim\nij \u03c6Sim\n\nij\n\nwhere \u03b8Acc and \u03b8Sim\n\nij are weights that specify the strength of factors \u03c6Acc and \u03c6Sim\nij .\n\ni=1\n\ni=1\n\ni=1\n\nj=i+1\n\nInferring Dependencies without Data The HF factors capture all dependencies among the\nheuristic functions that are not represented by the \u03c6DSP factor. The dependencies represented by the\n\u03c6DSP factor are precisely the dependencies that cannot be inferred via static analysis due to the fact\nthat this factor depends solely on the content of the primitives. It is therefore impossible to determine\nwhat this factor is without data.\nWhile assuming that we have the true \u03c6DSP seems like a strong condition, we \ufb01nd that in real-world exper-\niments, including the \u03c6DSP factor rarely leads to improvements over the case when we only include the\n\u03c6Acc and \u03c6HF factors. In some of our experiments (see Section 3), we use bounding box location, size and\nobject labels as domain-speci\ufb01c primitives for image and video querying tasks. Since these primitives\nare not correlated, modeling the primitive dependency does not lead to any improvement over just model-\ning HF dependencies from static analysis. Moreover, in other experiments where modeling the relation\namong primitives helps, we observe relatively small bene\ufb01ts above what modeling HF dependencies\nprovides (Section 3). Therefore, even without data, it is possible to model the most important depen-\ndencies among HFs that lead to signi\ufb01cant gains over the case in which no dependencies are modeled.\n\n5\n\n\uf8eb\uf8ed n(cid:88)\n\nn(cid:88)\n\nm(cid:88)\n\nm(cid:88)\n\n\uf8f6\uf8f8\n\n\f2.4 Generating Probabilistic Training Labels\n\nGiven the probability distribution of the factor graph, our goal is to learn the proper weights \u03b8Acc\nand \u03b8Sim\nij . Coral adopts structure learning approaches described in recent work [2], which learns\ndependency structures in the weak supervision setting and maximizes the (cid:96)1-regularized marginal\npseudolikelihood of each primitive to learn the weights of the relevant factors.\n\ni\n\nFigure 4: Simulation demonstrating improved generative model accuracy with Coral compared to\nstructure learning [2] and Coral. Relative improvement of Coral over structure learning is plotted\nagainst number of unlabeled data points (N) and number of HFs (n).\n\nTo learn the weights of the generative model, we use contrastive divergence [15] as a maximum\nlikelihood estimation routine and maximize the marginal likelihood of the observed primitives. Gibbs\nsampling is used to estimate the intractable gradients, which are then used in stochastic gradient descent.\nBecause the HFs are typically deterministic functions of the primitives (represented as the \u2212\u221e value\nof the correctness factors for invalid heuristic values), standard Gibbs sampling will not be able to\nmix properly. As a result, we modify the Gibbs sampler to simultaneously sample one primitive along\nwith all heuristics that depend on it. Despite the fact that the true class label is latent, this process still\nconverges to the correct parameter values [27]. Additionally, the amount of data necessary to learn the\nparameters scales quasilinearly with the number of parameters. In our case, the number of parameters\nis simply the number of heuristics n and the number of relevant primitive similarity dependencies s.\nWe now formally state the conditions for this result, which match those of Ratner et al. [27], and give the\nsample complexity of our method. First, we assume that there exists some feasible parameter set \u0398\u2282Rn\nthat is known to contain the parameter \u03b8\u2217 = (\u03b8Acc, \u03b8Sim) that models the true distribution \u03c0\u2217 of the data:\n(1)\nNext, we must be able to accurately learn \u03b8\u2217 if we are provided with labeled samples of the true\ndistribution. Speci\ufb01cally, there must be an asymptotically unbiased estimator \u02c6\u03b8 that takes some set\nof labeled data T independently sampled from \u03c0\u2217 such that for some c > 0,\n\n\u2203\u03b8\u2217\u2208 \u0398 s.t. \u2200\u03c0\u2217(p1,...,pm,Y ) = \u00b5\u03b8(p1,...,pm,Y ).\n\n(cid:16)\u02c6\u03b8(T )\n\n(cid:17)(cid:22) (2c|T|)\n\n\u22121I.\n\nCov\n\n(2)\n\n(3)\n\n(cid:105)\u2264 c\n\nn+s\n\nFinally, we must have enough suf\ufb01ciently accurate heuristics so that we have a reasonable estimate\nof Y. For any two feasible models \u03b81,\u03b82\u2208 \u0398,\n\nE(p1,...,pm,Y )\u223c\u00b5\u03b81\n\nVar(p(cid:48)\n\n1,...,p(cid:48)\n\nm,Y (cid:48))\u223c\u00b5\u03b82\n\n(Y (cid:48)| p1 = p(cid:48)\n\n1,...,pm = p(cid:48)\nm)\n\n(cid:104)\n\nProposition 1. Suppose that we run stochastic gradient descent to produce estimates of the weights\n\u02c6\u03b8 = (\u02c6\u03b8Acc, \u02c6\u03b8Sim) in a setup satisfying conditions (1), (2), and (3). Then, for any \ufb01xed error \u0001 > 0, if the\nnumber of unlabeled data points N is at least \u2126[(n+s)log(n+s)], then our expected parameter error\n\nis bounded by E(cid:104)(cid:107)\u02c6\u03b8\u2212\u03b8\u2217(cid:107)2(cid:105)\u2264 \u00012.\n\nThe proof follows from the sample complexity of Ratner et al. [27] and appears in the Appendix. With\nthe weights \u02c6\u03b8Acc\nij maximizing the marginal likelihood of the observed primitives, we have\na fully speci\ufb01ed factor graph and complete generative model, which can be used to predict the latent\nclass label. For each data point, we compute the label each heuristic function applies to it using the\n\nand \u02c6\u03b8Sim\n\ni\n\n6\n\n0250500N02550Relative Improvement (%)Binary Dependencies0250500N025503-ary Dependencies0250500N025504-ary Dependenciesn=6n=8n=10Relative Improvement over Structure Learning\fvalues of the domain-speci\ufb01c primitives. Through the accuracy factors, we then estimate a distribution\nfor the latent class label and use these noisy labels to train a discriminative model.\nWe present a simulation to empirically compare our sample complexity with that of structure learning\n[2]. In our simulation, we have n HFs, each with an accuracy of 75%, and explore settings in which\nthere exists one binary, 3-ary and 4-ary dependency among the HFs. The dependent HFs share exactly\none primitive, and the primitives themselves are independent (s = 0). We show our results in Figure\n4. In the case with a binary dependency, structure learning recovers the necessary dependency with\nfew samples, and has similar performance to Coral. In contrast, in the second and third settings with\nhigh-order dependencies, structure learning struggles to recover the relevant dependency, and performs\nworse than Coral even as more training data is provided.\n\n3 Experimental Results\n\nWe seek to experimentally validate the following claims about our approach. Our \ufb01rst claim is that HF\ndependencies inferred via static analysis perform signi\ufb01cantly better than a model that does not take\ndependencies into account. Second, we compare to a structure learning approach for weak supervision\n[2] and show how we outperform it over a variety of domains. Finally, we show that in case primitive\ndependencies exist, Coral can learn and model those as well. We show that modeling the dependencies\nbetween the heuristic functions and primitives can generate training sets that, in some cases, beat\nfully supervised models by labeling additional unlabeled data. Our classi\ufb01cation tasks range from\nspecialized medical domains to natural images and video, and we include details of the DSPs and\nHFs in the Appendix. Note that while the number of HFs and DSPs is fairly low (Table 1), using static\nanalysis to automatically infer dependencies rather than ask users to identify them saves signi\ufb01cant\neffort since the number of possible dependencies grows exponentially with the number of HFs present.\nWe compare our approach to majority vote (MV), generative models that learn the accuracies of different\nheuristics, speci\ufb01cally one that assumes the heuristics are independent (Indep) [27], and Bach et al. [2]\nthat learns the binary inter-heuristic dependencies (Learn Dep). We also compare to the fully supervised\n(FS) case, and measure the performance of the discriminative model trained with labels generated using\nthe above methods. We split our approach into two parts: inferring HF dependencies using only static\nanalysis (HF Dep) and additionally learning primitive level dependencies (HF+DSP Dep).\n\nFigure 5: Discriminative model performance comparing HF Dep (HF dependencies from static\nanalysis) and HF+DSP Dep (HF and DSP dependencies) to other methods. Numbers in Appendix.\n\nVisual Genome and ActivityNet Classi\ufb01cation We explore how to extract complex relations in\nimages and videos given object labels and their bounding boxes. We used subsets of two datasets,\nVisual Genome [21] and ActivityNet [9], and de\ufb01ned our task as \ufb01nding images of \u201ca person biking\ndown a road\u201d and \ufb01nding basketball videos, respectively. For both tasks, a small set of DSPs were\nshared heavily among HFs, and modeling the dependencies observed by static analysis led to a\nsigni\ufb01cant improvement over the independent case. Since these dependencies involved groups of\n3 or more heuristics, Coral improved signi\ufb01cantly over structure learning as well, which was unable to\nmodel these dependencies due to the lack of enough data. Moreover, modeling primitive dependencies\ndid not help since the primitives were indeed independent (Table 1). We report our results for these\ntasks in terms of the F1 score (harmonic mean of the precision and recall) since there was signi\ufb01cant\nclass imbalance which accuracy would not capture well.\n\nBone Tumor Classi\ufb01cation We used a set of 802 labeled bone tumor X-ray images along with their\nradiologist-drawn segmentations. Our task was to differentiate between aggressive and non-aggressive\n\n7\n\nCoral Performance\fTable 1: Heuristic Function (HF) and Domain-Speci\ufb01c Primitive (DSP) statistics. Discriminative\nmodel improvement with HF+DSP Dep over other methods. *improvements shown in terms of F1\nscore, rest in terms of accuracy. ActivityNet model is LR using VGGNet embeddings as features.\n\nApplication\n\nVisual Genome\n\nActivityNet\nBone Tumor\nMammogram\n\nNumber of\n\nDSPs HFs\n5\n4\n7\n6\n\n7\n5\n17\n6\n\nShared DSPs\n\n2\n2\n0\n0\n\nModel\n\nLR\n\nMV\n7.49*\nGoogLeNet\nVGGNet+LR 6.23*\n5.17\n4.62\n\nGoogLeNet\n\nImprovement Over\nIndep\nLearn Dep\n2.90*\n3.81*\n3.57\n1.11\n\n2.90*\n3.81*\n3.06\n\n0\n\nFS\n\n-0.74*\n-1.87*\n3.07\n-0.64\n\ntumors. We generated HFs that were a combination of hand-tuned rules and decision-tree generated\nrules (tuned on a small held out subset of the dataset). The discriminative model utilized a set of 400\nhand-tuned features (note that there is no overlap between these features and the DSPs) that encoded\nvarious shape, texture, edge and intensity-based characteristics. Although there were no explicitly\nshared primitives in this dataset, the generative model was still able to model the training labels more\naccurately with knowledge of how heuristics used primitives, which affects the relative false positive\nand false negative rates. Thus, the generative model signi\ufb01cantly improved over the independent\nmodel. Moreover, a small dataset size hindered structure learning, which gave a minimal boost over\nthe independent case (Table 1). When we used heuristics in Coral to label an additional 800 images\nthat had no ground truth labels, we beat the previous FS score by 3.07 points (Figure 5, Table 1).\n\nMammogram Tumor Classi\ufb01cation We used the DDSM-CBIS [32] dataset, which consists of\n1800 scanned \ufb01lm mammograms and associated segmentations for the tumors in the form of binary\nmasks. Our task was to identify whether a tumor is malignant or benign, and each heuristic only\noperated over one primitive, resulting in no dependencies that static analysis could identify. In\nthis case, structure learning performed better than Coral when we only used static analysis to infer\ndependencies (Figure 5). However, including primitive dependencies allowed us to match structure\nlearning, resulting in a 1.11 point improvement over the independent case (Figure 5, Table 1).\n\n4 Related Work\n\nAs the need for labeled training data grows, a common alternative is to utilize weak supervision sources\nsuch as distant supervision [10, 24], multi-instance learning [16, 30], and heuristics [8, 35]. Speci\ufb01cally\nfor images, weak supervision using object detection and segmentation or visual databases is a popular\ntechnique as well (detailed discussion in Appendix). Estimating the accuracies of these sources without\naccess to ground truth labels is a classic problem [13]. Methods such as crowdsourcing [12, 17, 40],\nboosting[3, 33], co-training [6], and learning from noisy labels are some of the popular approaches that\ncan combine various sources of weak supervision to assign noisy labels to data. However, Coral does\nnot require any labeled data to model the dependencies among the heuristics, which can be interpreted\nas workers, classi\ufb01ers or views for the above methods, and domain-speci\ufb01c primitives.\nRecently, generative models have also been used to combine various sources of weak supervision\n[1, 31, 36, 37]. One speci\ufb01c example, data programming [27], proposes using multiple sources of\nweak supervision for text data in order to describe a generative model and subsequently learns the\naccuracies of these sources. Coral also focuses on multiple programmatically encoded heuristics that\ncan weakly label data and learns their accuracies to assign labels to training data. However, Coral adds\nan additional layer of domain-speci\ufb01c primitives in its generative model, which allows it to generalize\nbeyond text-based heuristics. It also infers the dependencies among the heuristics and the primitives,\nrather than requiring users to specify them.\nOther previous work also assume that this structure in generative models is user-speci\ufb01ed [1, 31, 36].\nHowever, Bach et al. [2] recently showed that it is possible to learn the dependency structure among\nsources of weak supervision with a sample complexity that scales sublinearly with the number\nof possible pairwise dependencies. Coral instead identi\ufb01es the dependencies among the heuristic\nfunctions by inspecting the content of the programmable functions, therefore relying on signi\ufb01cantly\nless data to learn the generative model structure. Moreover, Coral can also pick up higher-order\ndependencies, for which Bach et al. [2] needs large amounts of data to detect.\n\n8\n\n\f5 Conclusion and Future Work\n\nIn this paper, we introduced Coral, a paradigm that models the dependency structure of weak\nsupervision heuristics and systematically combines their outputs to assign probabilistic labels to\ntraining data. We described how Coral takes advantage of the programmatic nature of these heuristics\nin order to infer dependencies among them via static analysis. Coral therefore requires a sample\ncomplexity that is quasilinear in the number of heuristics and relations found. We showed how Coral\nleads to signi\ufb01cant improvements in discriminative model accuracy over traditional structure learning\napproaches across various domains. Coral scratches the surface of the possible ways weak supervision\ncan borrow from the \ufb01eld of programming languages, especially as weak supervision sources are used to\nlabel large magnitudes of data and need to be encoded programmatically. We look at a natural extension\nof treating the process of encoding heuristics as writing functions and hope to explore the interactions\nbetween systematic training set creation and concepts from the programming language \ufb01eld.\n\nAcknowledgments We thank Shoumik Palkar, Stephen Bach, and Sen Wu for their helpful\nconversations and feedback. We are grateful to Darvin Yi for his assistance with the DDSM dataset\nbased experiments and associated deep learning models. We acknowledge the use of the bone\ntumor dataset annotated by Drs. Christopher Beaulieu and Bao Do and carefully collected over his\ncareer by the late Henry H. Jones, M.D. (aka \u201cBones Jones\u201d). This material is based on research\nsponsored by Defense Advanced Research Projects Agency (DARPA) under agreement number\nFA8750-17-2-0095. We gratefully acknowledge the support of the DARPA SIMPLEX program\nunder No. N66001-15-C-4043, DARPA FA8750-12-2-0335 and FA8750-13-2-0039, DOE 108845,\nthe National Science Foundation (NSF) Graduate Research Fellowship under No. DGE-114747,\nJoseph W. and Hon Mai Goodman Stanford Graduate Fellowship, National Institute of Health\n(NIH) U54EB020405, the Of\ufb01ce of Naval Research (ONR) under awards No. N000141210041\nand No. N000141310129, the Moore Foundation, the Okawa Research Grant, American Family\nInsurance, Accenture, Toshiba, and Intel. This research was supported in part by af\ufb01liate members\nand other supporters of the Stanford DAWN project: Intel, Microsoft, Teradata, and VMware. The\nU.S. Government is authorized to reproduce and distribute reprints for Governmental purposes\nnotwithstanding any copyright notation thereon. The views and conclusions contained herein are\nthose of the authors and should not be interpreted as necessarily representing the of\ufb01cial policies or\nendorsements, either expressed or implied, of DARPA or the U.S. Government. Any opinions, \ufb01ndings,\nand conclusions or recommendations expressed in this material are those of the authors and do not\nnecessarily re\ufb02ect the views of DARPA, AFRL, NSF, NIH, ONR, or the U.S. government.\n\n9\n\n\fReferences\n\n[1] E. Alfonseca, K. Filippova, J.-Y. Delort, and G. Garrido. Pattern learning for relation extraction with a\nhierarchical topic model. In Proceedings of the 50th Annual Meeting of the Association for Computational\nLinguistics: Short Papers-Volume 2, pages 54\u201359. Association for Computational Linguistics, 2012.\n\n[2] S. H. Bach, B. He, A. Ratner, and C. R\u00e9. Learning the structure of generative models without labeled data.\n\nIn ICML, 2017.\n\n[3] A. Balsubramani and Y. Freund. Scalable semi-supervised aggregation of classi\ufb01ers. In Advances in Neural\n\nInformation Processing Systems, pages 1351\u20131359, 2015.\n\n[4] I. Banerjee, L. Hahn, G. Sonn, R. Fan, and D. L. Rubin. Computerized multiparametric mr image analysis\n\nfor prostate cancer aggressiveness-assessment. arXiv preprint arXiv:1612.00408, 2016.\n\n[5] M. Blaschko, A. Vedaldi, and A. Zisserman. Simultaneous object detection and ranking with weak\n\nsupervision. In Advances in neural information processing systems, pages 235\u2013243, 2010.\n\n[6] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the\n\neleventh annual conference on Computational learning theory, pages 92\u2013100. ACM, 1998.\n\n[7] S. Branson, P. Perona, and S. Belongie. Strong supervision from weak annotation: Interactive training\nof deformable part models. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages\n1832\u20131839. IEEE, 2011.\n\n[8] R. Bunescu and R. Mooney. Learning to extract relations from the web using minimal supervision. In ACL,\n\n2007.\n\n[9] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles. Activitynet: A large-scale video\nbenchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pages 961\u2013970, 2015.\n\n[10] M. Craven, J. Kumlien, et al. Constructing biological knowledge bases by extracting information from\n\ntext sources. In ISMB, pages 77\u201386, 1999.\n\n[11] J. Dai, K. He, and J. Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for\nsemantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages\n1635\u20131643, 2015.\n\n[12] N. Dalvi, A. Dasgupta, R. Kumar, and V. Rastogi. Aggregating crowdsourced binary ratings. In Proceedings\n\nof the 22nd international conference on World Wide Web, pages 285\u2013294. ACM, 2013.\n\n[13] A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the EM\n\nalgorithm. Applied statistics, pages 20\u201328, 1979.\n\n[14] R. M. Haralick, K. Shanmugam, et al. Textural features for image classi\ufb01cation. IEEE Transactions on\n\nsystems, man, and cybernetics, 3(6):610\u2013621, 1973.\n\n[15] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural computation,\n\n14(8):1771\u20131800, 2002.\n\n[16] R. Hoffmann, C. Zhang, X. Ling, L. Zettlemoyer, and D. S. Weld. Knowledge-based weak supervision\nfor information extraction of overlapping relations. In Proceedings of the 49th Annual Meeting of the\nAssociation for Computational Linguistics: Human Language Technologies-Volume 1, pages 541\u2013550.\nAssociation for Computational Linguistics, 2011.\n\n[17] M. Joglekar, H. Garcia-Molina, and A. Parameswaran. Comprehensive and reliable crowd assessment\nalgorithms. In Data Engineering (ICDE), 2015 IEEE 31st International Conference on, pages 195\u2013206.\nIEEE, 2015.\n\n[18] D. Kang, J. Emmons, F. Abuzaid, P. Bailis, and M. Zaharia. Optimizing deep cnn-based queries over video\n\nstreams at scale. CoRR, abs/1703.02529, 2017. URL http://arxiv.org/abs/1703.02529.\n\n[19] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions.\n\nIn\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3128\u20133137, 2015.\n[20] M. R. Kaus, S. K. War\ufb01eld, A. Nabavi, P. M. Black, F. A. Jolesz, and R. Kikinis. Automated segmentation\n\nof mr images of brain tumors 1. Radiology, 218(2):586\u2013591, 2001.\n\n[21] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma,\net al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv\npreprint arXiv:1602.07332, 2016.\n\n[22] C. Kurtz, A. Depeursinge, S. Napel, C. F. Beaulieu, and D. L. Rubin. On combining image-based and\nontological semantic dissimilarities for medical image retrieval applications. Medical image analysis, 18\n(7):1082\u20131100, 2014.\n\n10\n\n\f[23] N. Meinshausen and P. B\u00fchlmann. High-dimensional graphs and variable selection with the lasso. The\n\nannals of statistics, pages 1436\u20131462, 2006.\n\n[24] M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data.\nIn Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International\nJoint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pages 1003\u20131011.\nAssociation for Computational Linguistics, 2009.\n\n[25] A. Oliver, J. Freixenet, J. Marti, E. P\u00e9rez, J. Pont, E. R. Denton, and R. Zwiggelaar. A review of automatic\nmass detection and segmentation in mammographic images. Medical image analysis, 14(2):87\u2013110, 2010.\n[26] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object localization for free? - Weakly-supervised learning\nwith convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 685\u2013694, 2015.\n\n[27] A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, and C. R\u00e9. Data programming: Creating large training sets,\n\nquickly. In Advances in Neural Information Processing Systems, pages 3567\u20133575, 2016.\n\n[28] P. Ravikumar, M. J. Wainwright, J. D. Lafferty, et al. High-dimensional ising model selection using\n\nl1-regularized logistic regression. The Annals of Statistics, 38(3):1287\u20131319, 2010.\n\n[29] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Uni\ufb01ed, real-time object detection.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 779\u2013788, 2016.\n[30] S. Riedel, L. Yao, and A. McCallum. Modeling relations and their mentions without labeled text. In Joint\nEuropean Conference on Machine Learning and Knowledge Discovery in Databases, pages 148\u2013163.\nSpringer, 2010.\n\n[31] B. Roth and D. Klakow. Combining generative and discriminative model scores for distant supervision.\n\nIn EMNLP, pages 24\u201329, 2013.\n\n[32] R. Sawyer-Lee, F. Gimenez, A. Hoogi, and D. Rubin. Curated breast imaging subset of ddsm, 2016.\n[33] R. E. Schapire and Y. Freund. Boosting: Foundations and algorithms. MIT press, 2012.\n[34] N. Sharma, L. M. Aggarwal, et al. Automated medical image segmentation techniques. Journal of medical\n\nphysics, 35(1):3, 2010.\n\n[35] J. Shin, S. Wu, F. Wang, C. De Sa, C. Zhang, and C. R\u00e9. Incremental knowledge base construction using\n\nDeepDive. Proceedings of the VLDB Endowment, 8(11):1310\u20131321, 2015.\n\n[36] S. Takamatsu, I. Sato, and H. Nakagawa. Reducing wrong labels in distant supervision for relation\nextraction. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics:\nLong Papers-Volume 1, pages 721\u2013729. Association for Computational Linguistics, 2012.\n\n[37] P. Varma, B. He, D. Iter, P. Xu, R. Yu, C. De Sa, and C. R\u00e9. Socratic learning: Augmenting generative\n\nmodels to incorporate latent subsets in training data. arXiv preprint arXiv:1610.08123, 2017.\n\n[38] W. Xia, C. Domokos, J. Dong, L.-F. Cheong, and S. Yan. Semantic segmentation without annotating segments.\n\nIn Proceedings of the IEEE International Conference on Computer Vision, pages 2176\u20132183, 2013.\n\n[39] D. Yi, M. Zhou, Z. Chen, and O. Gevaert. 3-d convolutional neural networks for glioblastoma segmentation.\n\narXiv preprint arXiv:1611.04534, 2016.\n\n[40] Y. Zhang, X. Chen, D. Zhou, and M. I. Jordan. Spectral methods meet em: A provably optimal algorithm\n\nfor crowdsourcing. Journal of Machine Learning Research, 17(102):1\u201344, 2016.\n\n[41] P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine learning research, 7(Nov):\n\n2541\u20132563, 2006.\n\n11\n\n\f", "award": [], "sourceid": 183, "authors": [{"given_name": "Paroma", "family_name": "Varma", "institution": "Stanford University"}, {"given_name": "Bryan", "family_name": "He", "institution": "Stanford University"}, {"given_name": "Payal", "family_name": "Bajaj", "institution": "Stanford University"}, {"given_name": "Nishith", "family_name": "Khandwala", "institution": "Stanford University"}, {"given_name": "Imon", "family_name": "Banerjee", "institution": "Stanford University"}, {"given_name": "Daniel", "family_name": "Rubin", "institution": "Stanford University"}, {"given_name": "Christopher", "family_name": "R\u00e9", "institution": "Stanford"}]}