{"title": "Learning Pipelines with Limited Data and Domain Knowledge: A Study in Parsing Physics Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 140, "page_last": 151, "abstract": "As machine learning becomes more widely used in practice, we need new methods to build complex intelligent systems that integrate learning with existing software, and with domain knowledge encoded as rules. As a case study, we present such a system that learns to parse Newtonian physics problems in textbooks. This system, Nuts&Bolts, learns a pipeline process that incorporates existing code, pre-learned machine learning models, and human engineered rules.  It jointly trains the entire pipeline to prevent propagation of errors, using a combination of labelled and unlabelled data.  Our approach achieves a good performance on the parsing task, outperforming the simple pipeline and its variants. Finally, we also show how Nuts&Bolts can be used to achieve improvements on a relation extraction task and on the end task of answering Newtonian physics problems.", "full_text": "Learning Pipelines with Limited Data and Domain\nKnowledge: A Study in Parsing Physics Problems\n\n\u2663Machine Learning Department, School of Computer Science, Carnegie Mellon University\n\n\u2660Department of Computer and Information Science, University of Pennsylvania\n\n\u2666Petuum Inc.\n\nMrinmaya Sachan\u2663 Avinava Dubey\u2663\n\nTom Mitchell\u2663 Dan Roth\u2660\n\nEric P. Xing\u2663\u2666\n\n{mrinmays,akdubey,tom.mitchell,epxing}@cs.cmu.edu\n\ndanroth@seas.upenn.edu\n\nAbstract\n\nAs machine learning becomes more widely used in practice, we need new methods\nto build complex intelligent systems that integrate learning with existing software,\nand with domain knowledge encoded as rules. As a case study, we present such\na system that learns to parse Newtonian physics problems in textbooks. This\nsystem, Nuts&Bolts, learns a pipeline process that incorporates existing code,\npre-learned machine learning models, and human engineered rules. It jointly trains\nthe entire pipeline to prevent propagation of errors, using a combination of labelled\nand unlabelled data. Our approach achieves a good performance on the parsing\ntask, outperforming the simple pipeline and its variants. Finally, we also show how\nNuts&Bolts can be used to achieve improvements on a relation extraction task\nand on the end task of answering Newtonian physics problems.\n\nIntroduction\n\n1\nEnd-to-end learning is the new buzz word in machine learning. Models trained in an end-to-end\nmanner have achieved state-of-the-art (SOTA) performance on various tasks like image classi\ufb01cation,\nmachine translation and speech recognition. However, a common barrier for using end-to-end learning\nis the amount of data needed to train the model. For reference, the SOTA image classi\ufb01cation model,\nVGGNet [43], is trained on 1.2M images with category labels and the SOTA machine translation\nmodel, GNMT [52], is trained on a dataset of 6M sentence pairs, 340M words. One possible remedy\nto the issue of data-hungriness is to incorporate domain knowledge. However, due to the very nature\nof the methods used, incorporating domain knowledge in end-to-end learning is challenging [47].\nIn contrast, pipelines [51] decompose a complex\ntask into a series of easier-to-handle sub-tasks\n(stages), where the local predictor at a particular\nstage depends on predictions from previous stages.\nPipelines can be tuned with small amount of la-\nbeled data and it is easier to incorporate domain\nknowledge expressed as rules, existing software\nand pre-learnt components. However, pipelining\nsuffers from propagation of local errors [11].\nThus, we propose Nuts&Bolts: an approach\nfor learning pipelines with labeled data, unlabeled\ndata, existing software and domain knowledge ex-\npressed as rules. By jointly learning the pipeline,\nNuts&Bolts retains the advantages of end-to-end learning (i.e. doesn\u2019t suffer from error propaga-\ntion). Furthermore, it allows for easy incorporation of domain knowledge and reduces the amount of\nsupervision required, removing the two key shortcomings of end-to-end learning.\nWe are motivated by the novel task of parsing Newtonian physics problems into formal language\n(see Figure 1). This is useful as it builds a computer ingestable rich semantic representation of\n\nFigure 1: Above: An example Newtonian physics\nproblem. Below: Diagram parsed in formal language.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\nThefigureshowsthreeforcesappliedtoatrunkthatmovesleftwardby3.00moverafrictionlessfloor.TheforcemagnitudesareF1=5.00N,F2=9.00N,andF3=3.00N,andtheindicatedangleis\u03b8=60.0\u00b0.Duringthedisplacement,whatisthenetworkdoneonthetrunkbythethreeforces?\u03b8F1F2F3Objects: {block, floor}Relative Position: lie-above(block, floor)Forces acting on block:{F1, F2, F3}Forces acting on floor: {} Force Directions: {F1: left, F2: right \u03b8above horizontal, F3: down}\fthese problems. It is also a key step in a large line of concurrent research on building a solver\nfor such problems [41, 17, 38, 37, 39]. The problems typically consist of a paragraph of text and\n(often) an associated diagram. These problems are quite diverse, representing complex physical and\nmathematical concepts which are not often present in natural text and images1 \u2013 understanding them\nitself requires substantial domain knowledge. Hence, traditional NLP and Computer Vision methods\ncannot be directly ported to extract and represent the semantics of these problems, and a richer\nintegration of domain knowledge is needed. We model Nuts&Bolts using logic programming\nand show improvements over multiple baselines as well as various pipeline approaches, achieving\nstate-of-the-art results.\nOur paper is structured as follows: we \ufb01rst introduce the task of parsing Newtonian physics problems\nin section 2 and use it to motivate our approach in section 3. Then, we describe the parsing pipeline\nin section 4, and our experiments in section 5. We review related work in section 6. Then, \ufb01nally we\nwill showcase the bene\ufb01t of our approach on the problem of relation extraction from text in section 7\nwhere we achieve state-of-the-art results outperforming Snorkel, a strong relation extraction system.\n2 Motivation\nProblem De\ufb01nition: To motivate our work, we \ufb01rst de\ufb01ne the task of parsing Newtonian physics\nproblems as one of mapping the problem (text and diagram) to a formal logical language. We choose\nour formal language as a subset of typed \ufb01rst-order logic comprising of constants (3.00 m, 5.00\nN, 60\u25e6, etc.), variables (F1, \u03b8, etc.), and a hand-picked set of 138 predicates (equals, mass,\ndistance, force, speed, velocity, work, etc.). Each element in the language is strongly\ntyped. For example, the predicate mass takes the \ufb01rst argument of the type \u201cobject\u201d and the second\nargument of the type \u201cmass-quantity\u201d such as 2kg, 3g, etc. As shown in Figure 1, the parse is\nrepresented as logical formulas, i.e. conjunctions over applications of (possibly negated) predicates\nwith arguments which can be constants or variables.\nThe primary motivation behind modelling this task as a\npipeline is the dif\ufb01culty of solving such a challenging\nproblem with a single monolithic learner; that express-\ning this problem directly in terms of input and output\nwill result in a complex function that may be impossible\nto learn. At the same time, there is a high cost associ-\nated with obtaining suf\ufb01cient labeled data to achieve\ngood learning performance which is dif\ufb01cult in such\nniche domains. However, the pipeline allows us to think\nof the task in a modular way, and to integrate stagewise\nsupervision and domain knowledge of physics into the\nmodel. It also allows us to supervise the various sub-\ncomponents to aid rapid learning.\nWe begin by describing the parsing pipeline at a high\nlevel. We break the parsing task into three phases. In\nthe \ufb01rst phase, we parse the diagram recognizing the\nvarious diagram elements and relationships between\nthem, leading to a diagram parse in formal language.\nIn the second phase, we parse the problem text into the\nsame formal language. In the third and \ufb01nal phase, we\nreconcile the diagram and the text parse and achieve\nthe \ufb01nal parse of the problem2. Diagram parsing is\nperformed in the following stages: (a) We \ufb01rst identify\nlow-level diagram elements such as the lines and arcs, corners i.e intersecting points\nof various low-level diagram elements, objects (e.g. block in Figure 1) and text elements,\ni.e. labels such as (cid:126)F1, (cid:126)F2, (cid:126)F3 and \u03b8 in Figure 1. (b) Then, we assemble the various low-level diagram\nelements (such as lines and arcs) to higher level diagram elements (such as axes, blocks, wedges,\npulleys, etc.) by a set of human engineered grouping rules. (c) Then, we map the various text\n\nFigure 2: A pipeline for diagram parsing with\nvarious (possibly multiple) pre-trained functions,\nexisting software and rules. The pre-learnt com-\nponents are shown in blue, the existing software\nare shown in red and rule-based components are\nshown in green.\n\n1For instance, in Figure 1, the visual concept of floor is represented as a solid horizontal line with many small, same sized, parallel lines\n\nwith their end points on the horizontal line (see Figure 3c). This concept can be concisely expressed as a rule but is hard to learn.\n\n2Due to space constraints, we only describe the diagram parsing phase in the main paper. The methodology discussed can be\nreadily extended to the other phases. We present the pipeline and results on text parsing and \ufb01nal reconciliation in the supplementary.\n\n2\n\n]]]Corner DetectionHigh-level ElementsObject DetectionLabel Element AssociationText Element DetectionFormal LanguageLine/Arc Detection\fs=1.\n\nelements to corresponding diagram elements detected in the previous stages. For example, the text\nelement (cid:126)F1 in Figure 1 refers to the leftward arrow in the diagram. (d) In the \ufb01nal step, we use a set\nof human-engineered rules to maps the diagram to formal language. We show the various stages in\nthe pipeline and their corresponding inputs and outputs in Figure 2. We will cover the details of the\npipeline later in section 4. For each stage of the pipeline, we have choices for various pre-existing\nsoftware (such a various line or corner detectors), rules or pre-learnt functions that we additionally\nwish to integrate. We also wish to minimize propagation of errors by learning the pipeline jointly.\nNext, we formalize the problem of learning such a pipeline.\n3 Method\nLet x \u2208 X and y \u2208 Y represent members of the input and output domains of a data mining task,\nrespectively where we wish to learn a prediction function f : X \u2192 Y such that \u02c6y \u2248 f (x).\nPipeline. We formally de\ufb01ne a pipeline P as a directed acyclic graph (DAG) G = (V, E) where\nnodes V represent various computation modules in the pipeline and edges E represent input/output\nrelationships between the various modules. Given G, we can always derive a topological ordering\nof the computation modules, thus decomposing the prediction problem into S stages. At each stage\ns, a predictor f (s) takes in as input the data instance x and predictions from all previous stages\nf (s) : z(s) \u2192 y(s) where z(s) = (x, \u02c6y(0), . . . , \u02c6y(s\u22121)). Given a model for each stage of the pipeline,\npredictions are made locally and sequentially with the expressed goal of maximizing performance on\nall the various stages, \u02c6y = f (x) = {f (s)(z(s))}S\nExtended Pipeline. Usually, a pipeline has a single predictor at each stage. However, system\nengineers are often faced with many choices for every stage of the pipeline. For example, they might\nhave to choose between many different object detectors or many different part-of-speech taggers. It\nwill be useful to not have to make that choice but have an ensemble of these choices. Hence, we\nextend our de\ufb01nition of the pipeline and assume that we are given multiple function approximators\n{f (s)\ni }Ks\ni=1 for the pipeline stage s and we wish to use them to estimate the true underlying function\nf (s). f (s)\ncould use a pre-existing software, encode a domain-speci\ufb01c rule or a pre-learnt function.\nProblem De\ufb01nition: Given the pipeline P, multiple function approximators for each stage\n{{f (s)\nn=1 , we want to learn the global\nprediction function f (x), Here, N denotes the total number of data instances and \u2126n is a set of\nstages for which supervision is available for the nth data instance xn. In general, at each stage,\nthe predictor f (s) may output a binary prediction y(s) \u2208 {0, 1} or a regression y(s) \u2208 (0, 1). We\ndesire a framework which can handle partial supervision, existing software and domain knowledge\nexpressed as rules in a feasible manner. To this end, we describe Nuts&Bolts, a probabilistic logic\nframework that integrates these inputs while minimizing a global objective de\ufb01ned over all stages.\n3.1 Nuts&Bolts\nProbabilistic Logic: Probabilistic logic infers the most likely values for a set of unobserved pred-\nicates, given a set of observed ground predicate values and logic rules. Unlike classical logic, the\ntruth values of ground predicates and rules are continuous and lie in the interval [0, 1], represent-\ning the probability that the corresponding ground predicate or rule is True. Boolean logic oper-\nators, such as AND (\u2227), OR (\u2228), NOT (\u00ac) and IMPLIES (\u2192) are rede\ufb01ned using Lukasiewicz\nlogic [23]: A \u2227 B = max{A + B \u2212 1, 0}, A \u2228 B = min{A + B, 1}, \u00acA = 1 \u2212 A, and\nA \u2192 B = min{1\u2212A+B, 1}. Next, we de\ufb01ne a probabilistic logic program in the most general case\nfor learning a pipeline that integrates different kinds of function approximators, domain knowledge\nand partial supervision. The probabilistic logic program comprises of a model for integrating (a)\nmultiple function approximators and (b) domain knowledge.\nA. Integrating multiple function approximators: Based on [30], we introduce a \u2018trustworthiness\u2019\ni \u2208 (0, 1) denote how much we can\nmodel to integrate multiple function approximators. Let T (s)\ntrust the function approximator f (s)\n. In probabilistic logic, we introduce the following rules\nwhich specify the relationship between various function approximators, the unknown true underlying\nfunction and our trusts on the various function approximators. Thus, we have:\n\ns=1 and partial supervision S = {(xn,{y(t)\n\nn }t\u2208\u2126n )}N\n\ni }Ks\n\ni=1}S\n\ni\n\ni\n\n(z(s)) \u2227 T (s)\nf (s)\ni\n(z(s)) \u2227 \u00acT (s)\n\ni \u2192 f (s)(z(s)),\u00acf (s)\ni \u2192 \u00acf (s)(z(s)),\u00acf (s)\n\ni\n\ni\n\n(z(s)) \u2227 T (s)\n(z(s)) \u2227 \u00acT (s)\n\ni \u2192 \u00acf (s)(z(s))\ni \u2192 f (s)(z(s))\n\nf (s)\ni\n\n(1)\n(2)\n\n3\n\n\fi\n\ni\n\nIntuitively, the \ufb01rst set of rules state that if a function approximator is trustworthy, its output should\nmatch the output of the true function. The second set of rules state that if a function approximator\nis not trustworthy, its output should not match the output of the true function. The trust values are\nimplicitly learnt based on the agreement between various function approximators [30].\nWe make an additional assumption that most of the function approximators are better than\n(z(s)) \u2192\nchance. With this assumption, we additionally add the following two rules: f (s)\nf (s)(z(s)),\u00acf (s)\n(z(s)) \u2192 \u00acf (s)(z(s)). This helps alleviate the identi\ufb01ability issues introduced\nby the above rules (eq. 1 and 2). Note that \ufb02ipping the values of trusts (i.e. setting them to one\nminus the trust values) and the true functions leads to the rules evaluating to the same set of rules as\nbefore. In probabilistic frameworks where all the rules are weighted with a real value in [0, 1], we can\nthink of the weight of these prior belief rules as regularization weights which can be learnt from data.\nNote that we estimate a single trust variable for every function approximator in the pipeline stage\n\u2013 trust is shared across data instances. Thus, the trust variables implicitly couple various function\napproximators by relating them to the true underlying function, aiding semi-supervised learning.\nB. Integrating domain knowledge: Pre-existing software or pre-learnt functions can be incorporated\nas function approximators in our probabilistic logic framework. Next, we will describe how we\nincorporate domain knowledge in the form of rules. We assume that the rules are provided to\nus as conditional statements (or implications) which can be read as \u201cif Precondition then\nPostcondition\u201d. Note that Precondition and Postcondition can be arbitrary logical\nformulas (i.e. conjunctions of possibly negated predicates). In our case, the rules relate the input at a\nstage z(s) to the output y(s). To incorporate these rules, we introduce a function approximator for the\nand a rule P recondition(z(s)) \u2192 f (s)(z(s)). Introducing rules as function approximators\nstage f (s)\nallows us to combine domain knowledge expressed as rules with arbitrary function approximators\nusing the formulation described in (A).\nC. Inference and Learning: We use a variant of probabilistic logic, PSL [3] in our work. PSL uses\nsoft logic as its logical component and Markov networks as its statistical model. Soft truth values of\nground predicates form variables of the PSL model and the model learns weights of various rules in\nthe logic program. Let X be the set of variables with known values each in the domain [0, 1] and Y\nbe the set of variables with unknown values in [0, 1]. Let \u03c6 = (\u03c61, . . . , \u03c6k) be the set of k potential\nfunctions to be de\ufb01ned later. Given free parameters \u03bb1, . . . , \u03bbk (which correspond to the weights of\nvarious rules), we de\ufb01ne the probability density over the set of unknown variables Y as:\n\nj\n\n(cid:33)\n\n(cid:80)k\n\nto see that the distance to satis\ufb01ability li can be written as max{0,(cid:80)u\n\nZ denotes the normalization constant to ensure that f is a proper probability density function.\nInference i.e. \ufb01nding the most probable values of the unknown variables Y in PSL is performed\nby solving the convex optimization problem: minY\ni=1 \u03bbi\u03c6i (X, Y) solved using consensus\nADMM. For learning, PSL proposes a number of approximate learning approaches such as structured\nperceptron, maximum-pseudolikelihood estimation and large-margin estimation. Our approach is\nagnostic to the choice of learning approach. In our experiments, we use maximum-pseudolikelihood\nestimation which maximizes the likelihood of each variable conditioned on all other variables. We\nrefer the reader to [3] for more details. In PSL, potential functions \u03c6i are typically chosen to be of\nthe form \u03c6i = (max{0, li(X, Y)})pj for pj \u2208 {1, 2} and li is some linear function corresponding\nto a measure of the distance to satis\ufb01ability of a logic rule. According to the theory of PSL, all logic\nrule can be written in the form A1 \u2227 A2 \u00b7\u00b7\u00b7 \u2227 Au \u2192 B1 \u2228 B2 \u00b7\u00b7\u00b7 \u2228 Bv [4]. In this case, it is easy\ni=1 Bi + 1 \u2212 u}\nand minimizing the distance to satis\ufb01ability amounts to making the rule more satis\ufb01ed. This joint\nmodeling of the entire pipeline avoids error propagation often incurred when we sequentially use\nlocal models. The joint modeling of known and unknown variables with variable coupling through\nthe various constraints described earlier aids semi-supervised learning as shown in our experiments.\n4 The Diagram Parsing Pipeline\nNext, we describe the various components of the diagram parsing pipeline, pointing out the pre-learnt\nfunctions, software and rules in each stage of the pipeline (see Table 1). Note that every pre-learnt\nfunction can be treated as a software. We present this difference merely due to philosophical reasons.\nIn the \ufb01rst stage, we detect low-level diagram elements (lines and arcs) using a number of pre-learnt\nfunctions and software. For corner detection, we use Harris corner detectors [14]. Then we further\n\ni=1 Ai \u2212(cid:80)v\n\n(cid:32)\n\n\u2212 k(cid:88)\n\ni=1\n\nf (Y) =\n\n1\nZ\n\nexp\n\n\u03bbi\u03c6i (X, Y)\n\n4\n\n\fTable 1: Various components of the diagram parsing pipeline. We denote the pre-learnt functions by \u2022, software\nby \u2022 and rules by \u2022 in each stage of the pipeline.\n\nApply a weak Gaussian blur on the raw image and then binarized it using a threshold selection method proposed\nin [26]. Then using it, apply:\n\u2022 Boundary detection and grouping method [19]\n\u2022 Hough transforms [9]\n\u2022 Detect parallel curved edge segments in a canny edge map.\n\u2022 Recursively merge proposals that exhibit a low residual when \ufb01t to a 1st or a 2nd degree polynomial.\n\u2022 A 2 class CNN resembling VGG-16 [43] with a fourth channel (which speci\ufb01es the location of the diagram element\n\nsmoothed with a Gaussian kernel of width 5) appended to the standard 3 channel RGB input.\n\n\u2022 Harris corner detector [14]\n{IF:THEN} expressions, i.e. IF \u2019condition\u2019 THEN \u2019result\u2019 rules as described below:\n\u2022 Arrow: The central line (stem line) is the longest of the three lines, the two arrowhead lines are roughly of the\nsame length, and the two angles subtended by the arrowhead lines with the arrow stem line must be roughly equal\n\u2022 Dotted line: The various lines should be in a straight line, roughly the same sized lines and equi-spaced\n\u2022 Ground: The solid line is in contact with a number of smaller parallel lines which subtend roughly the same angle\nwith it, their end-point lies on the solid line and the smaller lines are on the same side with respect to the solid line\n\u2022 Coordinate System: Three arrows where the arrow tails are incident on the same point. Two lines are mutually\nperpendicular (i.e. angle=90\u25e6) and the third roughly bisects the complementary (270\u25e6) angle\n\u2022 Block: Four lines which form a rectangle\n\u2022 Wedge: Three lines where each two share a distinct end-point\n\u2022 Pulley: A circle with two lines tangent to it. An end-point of the two lines lies on the circle\n\u2022 An off-the-shelf OCR system \u2013 Tesseract3.\n\u2022 Since many textual elements are heavily structured (these include elements in vector notation (e.g. (cid:126)F ), greek\nalphabets (e.g. \u03b8), physical quantities (e.g. 2 m/s)) and are usually longer than a single character, we trained a text\nlocalizer using a CNN having the same architecture as AlexNet [20]. We used the Chars74K dataset [7], a dataset\nobtained from vector PDFs of a number of physics textbooks and a set of synthetic renderings of structured textual\nelements generated by us as training data.\n\n\u2022 Window classi\ufb01cation [50]\n\u2022 Objectness [1]\n\u2022 A classi\ufb01er with features capturing location, size, central and Hu moments, etc.\n\u2022 A discriminatively trained part-based model [10] trained to focus on the detection of a manually selected list of\n\n\u2022 Global and local search [33]\n\n\u2022 Cascaded ranking svm [54]\n\n\u2022 Perceptual grouping [13, 5]\n\n\u2022 Selective search [49]\n\n\u2022 Edge boxes [56]\n\nobjects commonly seen in physics diagrams (blocks, pulleys, etc.).\n\n\u2022 Type Matching Rules: Type matching rules note that if the element is of type t1 and the text label is of type t2,\nthen the element should be matched to the text label. Thus, the rule can be written down as type(e, t1) \u2227\ntype(t, tw) \u2192 Mto. We have type matching experts for the following element-object types: (a) element is an\narrow and the text label is one of F., v., a., g, x, d indicating physical vector quantities such as forces, velocity,\nacceleration and displacement, (b) element is the coordinate system and the text label is one of x, y or z indicating\none of coordinate system axes, (c) element is a block or a wedge and the text labels it as a block\u2019 or \u2018wedge\u2019 (or\none of their synonyms) respectively.\n\n\u2022 Proximity Rules: The proximity rule notes that if the element and the text label are close to each other (i.e.\n\nthe\nclosest pixels of the element and the text label are closer than a threshold) then the element should be matched\nto the text label i.e. proximal(t, o) \u2192 Mto\n\u2022 Orientation Rule: The orientation rule notes that if the element and the text label are in the same orientation,\nthey should be matched i.e. orientation_match(t, o) \u2192 Mto. The orientations are computed using the \ufb01rst\nprincipal component of the grey scale pixels labeled as the element/text.\n\n\u2022 Rules (one for each predicate) decide if the predicate holds for a set of diagram elements which are type consistent\n\nwith the arguments of the predicate.\n\nLine\n\nCorner\n\nHigh\nLevel\n\nText\n\nObject\n\nLabel As-\nsociation\n\nFormal\nLanguage\n\n(b)\n\n(a)\nFigure 3: Some example high-level diagram elements: (a) Arrow, (b) Dotted line, (c) Ground, (d) Coordinate\nSystem, (e) Block, (f) Wedge, and (g) Pulley. We describe rules to form these elements in Table 1.\n\n(d)\n\n(g)\n\n(c)\n\n(e)\n\n(f)\n\nassemble these low-level diagram elements to high level elements. High level elements can be easily\nexpressed by humans as rules given their knowledge of Physics (see Figure 3). However, it is dif\ufb01cult\nto learn the input-output mapping for high-level elements directly as this will require a very large\namount of labelled data for each high-level element. We introduce a set of manually curated grouping\nrules for grouping low-level diagram elements to form high-level diagram elements. For example,\nthe rule to form an arrow tests if there are three detected lines which share an end-point which can\nbe combined to form an arrow. The three lines must also satisfy some additional conditions for the\nhigh-level element to be an arrow. The central line (stem line) is the longest of the three lines, the\ntwo arrowhead lines are roughly of the same length and the two angles subtended by the arrowhead\nlines with the arrow stem line must be roughly equal. This rule is incorporated as shown below:\n\nC1 = isLine(line1) \u2227 isLine(line2) \u2227 isLine(line3)\nC2 = length(line1) > length(line2) \u2227 length(line1) > length(line3)\nC3 = roughly_equal(angle(line1, line2), angle(line1, line3))\nC1 \u2227 C2 \u2227 C3 \u2192 Hline\n\ni.e. line1 is stem\n\n5\n\n\u03b8\u03b8\f(cid:80)\n\nAll our rules take the form of {IF:THEN} expressions, e.g. {IF condition THEN result},\nwhere the condition tests if a set of detected low-level elements satisfy the requirements to\nform the high-level element. In general, we write down a rule for high-level element detection\nas ri : AN D(li1 , li2, . . . , li\u03b1 , ci1, ci2, . . . , ci\u03b2 ) \u2192 hi s.t. the rule preconditions Pi1, Pi2, . . . , Pi\u03b3\nare all satis\ufb01ed. Here, li1, li2, . . . , li\u03b1 denote pre-requisite low-level elements and ci1, ci2, . . . , ci\u03b2\ndenote pre-requisite corner elements required for the application of rule ri leading to the formation\nof high-level element hi. Then, we map textual element labels with diagram elements. Let Mte\nrepresent a variable that takes values 1 if the detected element (high-level element or object) e is\nmatched with the detected text label t, and 0 otherwise. Here, we have a matching constraint that\ne Mte = 1 which states that every text label must be matched to exactly one high-level element\nor object. Next, we build a set of candidate matching rules. These rules essentially capture features\naccounting for type, shape, orientation, distance, etc. In the \ufb01nal stage, we again use a set of rules to\nmap diagrams to formal language. These rules, one for each predicate, decide if the predicate holds\nfor a set of diagram elements which are type consistent with the arguments of the predicate. We have\nrules for listing objects, relative position of objects, forces acting on objects, force directions, etc.\n5 Experiments\nImplementation: In our implementation, we over-generate candidate low-level elements, corners,\nobjects, text elements and plausible high-level elements from various candidate function approxima-\ntors. Then, we create a set of binary variables which take value 1 if the element/corner/object/text\nelement is correct and 0 otherwise. Then, with these variables and domain knowledge expressed as\nrules, we use Nuts&Bolts for learning the pipeline.\nDataset: We validated our system on a dataset of physics questions taken from three popular\npre-university physics textbooks: Resnick Halliday Walker, D. B. Singh and NCERT.\nMillions of students in India study physics from these books every year and these books are available\nonline. We manually identi\ufb01ed chapters relevant for Newtonian physics in these textbooks and took\nall the exercise questions provided in these chapters. This resulted in a dataset of 4941 questions,\nout of which 1019 had associated diagrams. We partitioned the dataset into a random split of 4441\ntraining (912 diagrams) and 500 test (107 diagrams) questions. We annotated ground truth logical\nforms for a part of the training set (1000 questions containing 207 diagrams) and the entire test\nset. The annotated train set questions were used along with the unannotated train set questions for\ntraining our method \u2013 thus, our method is semi-supervised. We report our results on the test set. We\nadditionally evaluated our system on the task of answering these problems on two datasets: section 1\nof three AP Physics C Mechanics tests4 \u2013 a practice test and of\ufb01cial tests for the years 1998 and 2012.\nExperimental Design: We design our experimental study to evaluate the following claims:\nC1: Nuts&Bolts outperforms prior work on the task of diagram parsing. This also leads to\n\nimprovements on down stream tasks \u2013 in our case, answering Newtonian physics questions.\n\nC2: Nuts&Bolts utilizes labelled data as well as unlabelled data to achieve better performance.\nC3: Nuts&Bolts can incorporate supervision at various stages of the pipeline. It is robust to low\n\namounts of supervision at certain stages in the pipeline.\n\nC4: Nuts&Bolts jointly models the various stages of a pipeline which prevents error propagation.\nC1: Since there is no existing prior work which performs end-to-end parsing of diagrams to formal\nlanguage, we created three baselines. The \ufb01rst baseline, EB proposes diagram elements using\nEdgeBoxes [56] and then uses rules de\ufb01ned in Table 1 for label associations and generation of\nformal language. EdgeBoxes was chosen as it relies less on colors and gradients observed in\nnatural images, and because it is the only computer vision approach that performs well on all element\ndetection stages in our experiments. The second baseline, G-ALIGNER proposes diagram elements\nusing G-ALIGNER [40] and then uses rules. G-ALIGNER works by maximizing the agreement\nbetween textual and visual data via a submodular objective. Similarly, the third baseline, DSDP-Net\nuses DSDP-Net [17] followed by rules. DSDP-Net proposes diagram elements using an LSTM\nbased scoring mechanism.\nWe compare the three baselines with Nuts&Blots in terms of the performance on predicting each\ntype of diagram element as well as overall results in terms of the \ufb01nal diagram parse. We use two\nmetrics: (a) Jaccard similarity [21] and (b) F1 score (comparing with gold annotation). Table 2\n\n4The other sections of the tests are subjective which we leave as future work. Details about the exam questions are available at: https:\n\n//apstudent.collegeboard.org/apcourse/ap-physics-c-mechanics\n\n6\n\n\fTable 2: Comparison (F1 scores and Jaccard similarity with gold annotation) for individual stages of the pipeline\nas well as the \ufb01nal parse. Certain baselines cannot be used to model some stages of the pipeline (denoted as -).\nWe use the corresponding Nuts&Blots model for those stage to compute the \ufb01nal parse. The performance of\nvarious component function approximators is available in the supplementary.\n\nLow-level\nCorner\nHigh-level\nText\nObject\nLabel Associations\nParsing Performance\n\n-\n\nEB\n0.59\n0.45\n0.57\n0.33\n\n-\n\n0.42\n\nG-ALIGNER\n\nF1\nDSDP-Net\n\nNuts&Bolts\n\n0.78\n0.95\n0.71\n0.77\n0.30\n0.80\n0.65\n\n-\n\n0.76\n0.50\n0.66\n0.52\n\n-\n\n0.58\n\n0.94\n0.95\n0.90\n0.90\n0.82\n0.83\n0.74\n\n-\n\nEB\n0.57\n0.42\n0.54\n0.29\n\n-\n\n0.44\n\nJaccard Similarity\nDSDP-Net\n\nG-ALIGNER\n\nNuts&Bolts\n\n0.80\n0.91\n0.74\n0.78\n0.31\n0.86\n0.68\n\n-\n\n0.83\n0.52\n0.71\n0.47\n\n-\n\n0.56\n\n0.87\n0.91\n0.82\n0.85\n0.64\n0.88\n0.78\n\n\u201998\n16\n16\n15\n24\n\nT\n20\n23\n21\n32\n\nP\n16\n19\n16\n24\n\nEB\nG-ALIGNER\nDSDP-Net\n\n\u201912\n18\n21\n18\n26\nN&B\nTable 3: Question Answering accu-\nracy of Nuts&Bolts (N&B) com-\npared to the various baselines in the\nParsing to Programs (P2P)\nframework on four datasets: prob-\nlems from physics textbooks (T), AP\nPhysics C Mechanics \u2013 Section 1\npractice test (P) and of\ufb01cial tests for\n1998 and 2012.\n\nreports the results. Nuts&Blots achieves a much superior performance to all the three baselines on\nboth the metrics. Furthermore, we get improvements on all the stages of the pipeline. Prior computer\nvision techniques are tuned for natural images and hence, do not port well to diagrams which require\ndomain knowledge. However, our carefully engineered pipeline with ensembles of element detectors\nand explicit domain knowledge in the form of rules can work well even in this challenging domain.\nThe results for problem text parsing are available in the supplementary.\nadditionally used Nuts&Bolts in Parsing to\nWe\nPrograms [37], an existing framework proposed for \u201csit-\nuated\u201d question answering. The system takes in the formal\nrepresentation of the problem and uses it to solve the problems\nusing an expert system with axioms and laws of Physics written\ndown as executable programs. We compared systems which use\nthe Nuts&Bolts output against systems that use the output\nfrom various diagram parsing baselines as the formal diagram\nparse (all systems used Nuts&Bolts for text parsing). Table\n3 shows the score (percentage of questions correctly answered)\nachieved by Nuts&Bolts and the various baselines on the\nfour question answering datasets describe before. We observe\nthat Nuts&Bolts achieves a better performance than all the\nbaselines on the challenging task of answering Newtonian physics\nproblems. We show examples of correctly and incorrectly\nanswered questions in the supplementary.\nC2: A key bene\ufb01t of Nuts&Blots is its ability to incor-\nporate unlabelled data. We investigate how changing the\namount of unlabelled data in addition to the labelled data\nchanges the performance of our diagram parser. Figure 4\nplots the F1 performance of the \ufb01nal diagram parse as well\nas various stages as we vary the amount of unlabelled dia-\ngrams while keeping the labelled diagram set \ufb01xed. We can\nobserve from the plot that adding unlabeled data substan-\ntially improves performance (from 0.58 when no unlabelled\ndata is used to 0.74 when all unlabelled data is used). Such\nimprovements are also observed for various stages of the\npipeline to varying degrees.\nC3: Nuts&Blots allows for varying amount of supervision at each stage in the pipeline. Figure\n5 shows the performance of individual pipeline stages as well as overall performance when we vary\nsupervision for one stage in the pipeline (keeping supervision for all other stages same). We can\nobserve that our model has a robust performance when supervision at an individual stage is reduced\n(by reducing the number of diagrams whose supervision is provided to the stage). For example, the\nperformance merely reduces from 0.74 to 0.61 even when supervision to the text-element detection\nstage is reduced from 207 diagrams (the entire labelled set) to 25 diagrams.\nC4: Nuts&Blots learns the entire pipeline jointly, thus preventing error propagation. To test this,\nwe perform an ablation study using a traditional pipeline which makes sequential predictions in a\nstagewise manner. We consider variants of the traditional pipeline which aggregates the function\n\nFigure 4: F1 score for diagram parsing\nand various pipeline stages with varying\namount of unlabelled diagrams.\n\n7\n\n0.50.60.70.80.910200400600F1 ScoreUnlabeled DiagramsLow-levelCornerHigh-levelTextObjectElement-LabelOverall\fTable 4: F1 scores for identifying diagram elements, label associations and the \ufb01nal parse.\n\nLow-level\nText\nObject\nLabel Associations\nParsing Performance\n\nBest pred.\n\n0.70\n0.68\n0.63\n0.67\n0.55\n\nTraditional Pipeline\nWtd. Avg.\nMaj. Vote\n0.81\n0.79\n0.79\n0.77\n0.68\n0.71\n0.74\n0.73\n0.65\n0.59\n\nAdaBoost\n0.84\n0.80\n0.71\n0.72\n0.66\n\nNuts&Bolts\nNo Learn\n0.76\n0.80\n0.73\n0.73\n0.59\n\nTrust=1\n0.81\n0.84\n0.76\n0.76\n0.64\n\nFull\n0.94\n0.90\n0.82\n0.83\n0.74\n\napproximators via various combination approaches: best predictor, majority vote,\nweighted average and AdaBoost. We report the results in Table 4. We observe that\nNuts&Bolts performs signi\ufb01cantly better than each of the traditional pipeline, validating our case.\nThe case can be made stronger by observing Figure 5. The dotted green curve represents the\nperformance of models that learn the combination of various function approximators separately\nfor that stage. The solid green curve shows results for Nuts&Bolts. The difference between the\nslope of the solid green curve and the dotted green curve show that independently trained function\ncombinators suffer much more than jointly trained function combinators with decreasing amount of\navailable training data. Similarly, the difference between slopes of dotted blue curve and solid blue\ncurve show how overall performance degrades when the function combinator for a particular stage is\nindependently learned, particularly in low data scenarios.\nWe further investigate the importance of modelling trust in our approach. We compared\nNuts&Bolts to a variant of Nuts&Bolts where we do not model trustworthiness (i.e. set\nall trust variables to 1). We observe a signi\ufb01cant drop in performance which con\ufb01rms the necessity of\nlearning trust for various candidate function approximators. Then, we also show results when we do\nnot perform learning i.e. we simply set the weights of the various rules in the PSL automatically to\n1 in Table 4. We can observe that the performance of our model drops even if we do not learn rule\nweights of our Nuts&Bolts approach showing the importance of learning.\n6 Related Work\nThe issue of data hungriness of end-to-end models is well known [53, 22]. Even though a number\nof regularization [45] and semi-supervised learning [18] techniques have been proposed, in practice\ndata hungriness is still an issue in these models. One possible solution is incorporating domain\nknowledge into these models. This is also dif\ufb01cult and heuristic solutions such as fusing predictions\nfrom external models, feature concatenation or averaging output are popular [27]. Two promising\nbut under-explored lines of work here are [16] and [25] who propose a teacher-student network to\nharness logical rules and distant supervision, respectively,\nWe mitigate the data hungriness issue by learning a pipeline \u2013 thereby making the learning process\nmodular [2] and integrating domain knowledge \u2013 yet, without incurring error propagation. Our\nproposal is related to previous works which propose linear programming [24, 32] and graphical\nmodel [31, 15] formulations for post-hoc inference in a cascade of classi\ufb01ers. A key difference in\nthis line of work and our work is that we allow multiple function approximators in each stage in the\npipeline and integration of domain knowledge in the form of rules. The former has been explored\nseparately in various ensemble approaches [8], in error estimation from unlabeled data [28, 29, 30]\nand crowd-sourcing applications [46]. Our work is also related to parallel pipelines for feature\n\nFigure 5: These plots show how the performance (F1 score for diagram parsing) varies when we vary the\namount of supervision in our models for one stage in the pipeline, keeping the supervision at all other stages the\nsame. The chart label notes the pipeline stage whose supervision is varied. For example, in the \ufb01rst plot we vary\nthe amount of low-level element supervision provided to the system. The plots in solid green (SP) show the\nperformance at the stage level, the solid blue plots (OP) show the overall performance of the system (in terms\nof diagram parse literals). The dotted green curve (ISP) represents the performance of models that learn the\ncombination of various function approximators independently for that stage. The dotted blue curve (ISOP) is\nobtained by incorporating the independently learned aggregator into Nuts&Bolts.\n\n8\n\n0.20.40.60.812575125175F1 ScoreLabeled DataLow-level ElementsSPOPISOPISP0.20.40.60.812575125175F1 ScoreLabeled DataHigh-level Elements0.20.40.60.812575125175F1 ScoreLabeled DataText Elements0.20.40.60.812575125175F1 ScoreLabeled DataObject Elements0.20.40.60.812575125175F1 ScoreLabeled DataElement-Label Mapping\fTable 5: F1 scores for identifying diagram elements, label associations and the \ufb01nal parse using the bi-lattice\nlogic formalism [42], PCFG based hierarchical grammars [55] and Nuts&Bolts (N&B). We have different\nvariants for each model where we perform only inference (Inf.) where rule weights are set to 1, supervised (Sup.)\nand semi-supervised learning (Semi-sup.).\nBi-lattice logic\nSup.\nInf.\n\nPCFG\nSupervised\n\nN&B\nSup.\n\nInf.\n\nSemi-sup.\n\nInf.\n\nSemi-sup.\n\nLow-level\nCorner\nHigh-level\nText\nObject\nLabel Associations\nParsing Performance\n\n0.65\n0.81\n0.57\n0.57\n0.50\n0.48\n0.49\n\n0.70\n0.85\n0.63\n0.59\n0.53\n0.54\n0.53\n\n0.62\n0.76\n0.54\n0.53\n0.48\n0.49\n0.47\n\n0.66\n0.79\n0.59\n0.56\n0.51\n0.55\n0.51\n\n0.77\n0.90\n0.66\n0.78\n0.68\n0.70\n0.64\n\n0.72\n0.86\n0.62\n0.60\n0.53\n0.52\n0.52\n\n0.75\n0.89\n0.65\n0.61\n0.57\n0.60\n0.58\n\n0.94\n0.95\n0.90\n0.90\n0.82\n0.83\n0.74\n\nextraction, bi-lattice logic formalism, hierarchical grammars, performance modeling and Bayesian\nfusion methods. These formalisms can be used in our problem setting. However, these usually do not\nincorporate: (a) multiple function approximators for each sub-task which is necessary when we have\nmany weak learners but no one best model to do the sub-task, (b) existing software pieces as sub-task\nfunctions, (c) stage-wise supervision, and (d) semi-supervised learning. All these are necessary in low\ndata scenarios when end-to-end models are infeasible. For the sake of completeness, we implemented\nthe bi-lattice logic formalism as in [42] and a PCFG based hierarchical grammar formalism proposed\nin [55] where we used the best function approximator at each stage of the pipeline. PCFGs are\npopular in NLP and can be trained both in a supervised (via MLE) as well as semi-supervised manner\n(via EM). Table 5 reports F1 scores of various stages as well as the overall parsing and shows how our\nsupervised variant outperforms the supervised learners in the bi-lattice logic formalism and PCFG.\nIn addition, when we incorporate unlabeled data, Nuts&Bolts achieves a huge boost \u2013 a signi\ufb01cant\nimprovement over the supervised competitors and semi-supervised PCFG. Finally, our method is\nbased on PSL [3] which allows for easy integration of \ufb01rst order logic rules. PSLs are a generalization\nof MLNs [36] which can also be used. Our approach can also be extended to other constraint driven\nmethods such as CCMs [6] which incorporate domain knowledge in the form of constraints.\nFrom an application perspective, while, the domain of natural images has received a lot of interest,\nthe domain of diagram analysis is not very well studied. In particular, [12] analyzed graphs and\n\ufb01nite automata sketches, [17] studied food-web diagrams and [40] studied geometry diagrams. As\nshown in our experiments, these techniques do not perform as well as Nuts&Bolts. This is because\nwe leverage domain knowledge and structure of diagrams by building a bottom-up diagram parser.\nThe idea of bottom-up analysis has been sparsely explored in images [48, 55, 44], however, without\ndomain knowledge and use of existing softwares in the pipeline process.\n7 Relation Extraction\n\nWe additionally also performed experiments on relation extraction (a key NLP task) comparing our\napproach to Snorkel, the recently proposed state-of-the-art approach for the task (it beat the previous\nbest LSTM model and won the 2014 TAC-KBP challenge). We followed the same experimental\nprotocol as in [34, 35] and compared our method to Snorkel (discriminative model \u2013 the best Snorkel\nmodel) on four datasets provided with the Snorkel release. Table 6 shows the predictive performance\n(in terms of Precision, Recall and F1 scores) on the relation extraction task. We use the same labeling\nrules as described in the Snorkel papers. Nuts&Bolts achieves a better performance than Snorkel on\nall the four datasets, and notably a new state-of-the-art. This makes a case that our framework is\nindeed general and widely applicable.\nTable 6: Precision/Recall/F1 scores comparing Nuts&Bolts to Snorkel on four relation extraction datasets.\n\nCDR\n\nR\n54.3\n54.5\n\nP\n38.8\n41.5\n\nF1\n45.3\n47.1\n\nSnorkel\nN&B\n\nSpouses\n\nP\n48.4\n49.3\n\nR\n61.6\n61.9\n\nF1\n54.2\n54.9\n\nKBP (News)\n\nP\n50.5\n51.2\n\nR\n29.2\n30.3\n\nF1\n37.0\n38.1\n\nGenomics\n\nP\n83.9\n84.5\n\nR\n43.4\n43.3\n\nF1\n57.2\n57.3\n\n8 Conclusion\nWe proposed Nuts&Bolts, a framework to learn pipelines with a provided hierarchy of sub-\ntasks. Our framework incorporates multiple function approximators for various sub-tasks, domain\nknowledge in the form of rules and stage-wise supervision. Nuts&Bolts is a philosophy of learning\nwith modularization and can be bene\ufb01ciary in limited data domains when we can learn from labelled\nas well as unlabelled data and when end-to-end training becomes infeasible.\n\n9\n\n\fAcknowledgements\n\nThis work is supported by the ONR grant N000141712463 and the NIH grant R01GM114311. Any\nopinions, \ufb01ndings and conclusions or recommendations expressed in this material are those of the\nauthor(s) and do not necessarily re\ufb02ect the views of ONR or NIH.\n\nReferences\n[1] Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. Measuring the objectness of image\nwindows. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11):2189\u20132202,\n2012.\n\n[2] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n39\u201348, 2016.\n\n[3] Stephen H Bach, Matthias Broecheler, Bert Huang, and Lise Getoor. Hinge-loss markov random\n\ufb01elds and probabilistic soft logic. Journal of Machine Learning Research, 18(109):1\u201367, 2017.\n[4] Matthias Br\u00f6cheler, Lilyana Mihalkova, and Lise Getoor. Probabilistic similarity logic. In\nProceedings of the Twenty-Sixth Conference on Uncertainty in Arti\ufb01cial Intelligence, UAI\u201910,\npages 73\u201382, 2010.\n\n[5] Jo\u00e3o Carreira, Fuxin Li, and Cristian Sminchisescu. Object recognition by sequential \ufb01gure-\n\nground ranking. International journal of computer vision, 98(3):243\u2013262, 2012.\n\n[6] Ming-Wei Chang, Lev Ratinov, and Dan Roth. Structured learning with constrained conditional\n\nmodels. Machine learning, 88(3):399\u2013431, 2012.\n\n[7] T. E. de Campos, B. R. Babu, and Manik Varma. Character recognition in natural images. In\nProceedings of the International Conference on Computer Vision Theory and Applications,\nLisbon, Portugal, February 2009.\n\n[8] Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on\n\nmultiple classi\ufb01er systems, pages 1\u201315. Springer, 2000.\n\n[9] Richard O Duda and Peter E Hart. Use of the hough transformation to detect lines and curves in\n\npictures. Communications of the ACM, 15(1):11\u201315, 1972.\n\n[10] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. Object\ndetection with discriminatively trained part-based models. IEEE TPAMI, 32(9):1627\u20131645,\n2010.\n\n[11] Jenny Rose Finkel, Christopher D Manning, and Andrew Y Ng. Solving the problem of\ncascading errors: Approximate bayesian inference for linguistic annotation pipelines.\nIn\nProceedings of the 2006 Conference on Empirical Methods in Natural Language Processing,\npages 618\u2013626. Association for Computational Linguistics, 2006.\n\n[12] Robert P Futrelle, Mingyan Shao, Chris Cieslik, and Andrea Elaina Grimes. Extraction, layout\nanalysis and classi\ufb01cation of diagrams in pdf documents. In Document Analysis and Recognition,\n2003. Proceedings. Seventh International Conference on, pages 1007\u20131013. IEEE, 2003.\n\n[13] Chunhui Gu, Joseph J Lim, Pablo Arbel\u00e1ez, and Jitendra Malik. Recognition using regions. In\n\nComputer Vision and Pattern Recognition (CVPR), 2009, pages 1030\u20131037. IEEE, 2009.\n\n[14] C. Harris and M. Stephens. A combined corner and edge detection. In Proceedings of The\n\nFourth Alvey Vision Conference, pages 147\u2013151, 1988.\n\n[15] Kristy Hollingshead and Brian Roark. Pipeline iteration. In Proceedings of the 45th Annual\n\nMeeting of the Association of Computational Linguistics, pages 952\u2013959, 2007.\n\n[16] Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard Hovy, and Eric Xing. Harnessing deep neural\n\nnetworks with logic rules. arXiv preprint arXiv:1603.06318, 2016.\n\n[17] Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Min Joon Seo, Hannaneh Hajishirzi, and Ali\nFarhadi. A diagram is worth a dozen images. In Computer Vision - ECCV 2016 - 14th European\nConference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, pages\n235\u2013251, 2016.\n\n[18] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-\nsupervised learning with deep generative models. In Advances in Neural Information Processing\nSystems, pages 3581\u20133589, 2014.\n\n[19] Iasonas Kokkinos. Highly accurate boundary detection and grouping.\n\nComputer Vision and Pattern Recognition (CVPR), pages 2520\u20132527, 2010.\n\nIn Conference on\n\n10\n\n\f[20] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[21] Michael Levandowsky and David Winter. Distance between sets. Nature, 234(5323):34, 1971.\n[22] Yuxi Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274, 2017.\n[23] Jan Lukasiewicz. O logice tr\u00f3jwartociowej. Studia Filozo\ufb01czne, 270(5), 1988.\n[24] Tomasz Marciniak and Michael Strube. Beyond the pipeline: Discrete optimization in nlp. In\nProceedings of the Ninth Conference on Computational Natural Language Learning, pages\n136\u2013143. Association for Computational Linguistics, 2005.\n\n[25] Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for relation extrac-\ntion without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting\nof the ACL and the 4th International Joint Conference on Natural Language Processing of the\nAFNLP: Volume 2-Volume 2, pages 1003\u20131011. Association for Computational Linguistics,\n2009.\n\n[26] Nobuyuki Otsu. A threshold selection method from gray-level histograms. Automatica, 11\n\n(285-296):23\u201327, 1975.\n\n[27] Eunbyung Park, Xufeng Han, Tamara L Berg, and Alexander C Berg. Combining multiple\nsources of knowledge in deep cnns for action recognition. In Applications of Computer Vision\n(WACV), 2016 IEEE Winter Conference on, pages 1\u20138. IEEE, 2016.\n\n[28] Emmanouil Antonios Platanios, Avrim Blum, and Tom M. Mitchell. Estimating accuracy\nfrom unlabeled data. In Proceedings of the Thirtieth Conference on Uncertainty in Arti\ufb01cial\nIntelligence, UAI 2014, Quebec City, Quebec, Canada, July 23-27, 2014, pages 682\u2013691, 2014.\n[29] Emmanouil Antonios Platanios, Avinava Dubey, and Tom M. Mitchell. Estimating accuracy\nfrom unlabeled data: A bayesian approach. In Proceedings of the 33nd International Conference\non Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 1416\u20131425,\n2016.\n\n[30] Emmanouil Antonios Platanios, Hoifung Poon, Tom M. Mitchell, and Eric Horvitz. Estimating\naccuracy from unlabeled data: A probabilistic logic approach. CoRR, abs/1705.07086, 2017.\nURL http://arxiv.org/abs/1705.07086.\n\n[31] Vasin Punyakanok, Dan Roth, Wen-tau Yih, and Dav Zimak. Learning and inference over\nconstrained output. In IJCAI-05, Proceedings of the Nineteenth International Joint Conference\non Arti\ufb01cial Intelligence, pages 1124\u20131129, 2005.\n\n[32] Karthik Raman, Adith Swaminathan, Johannes Gehrke, and Thorsten Joachims. Beyond\nmyopic inference in big data pipelines. In Proceedings of the 19th ACM SIGKDD international\nconference on Knowledge discovery and data mining, pages 86\u201394. ACM, 2013.\n\n[33] Pekka Rantalankila, Juho Kannala, and Esa Rahtu. Generating object segmentation proposals\nusing global and local search. In Proceedings of the IEEE conference on computer vision and\npattern recognition, pages 2417\u20132424, 2014.\n\n[34] A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, and C. R\u00e9. Data programming: Creating large\n\ntraining sets, quickly. In NIPS, pages 3567\u20133575, 2016.\n\n[35] Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher R\u00e9.\nSnorkel: Rapid training data creation with weak supervision. arXiv preprint arXiv:1711.10160,\n2017.\n\n[36] Matthew Richardson and Pedro Domingos. Markov logic networks. Machine learning, 62(1-2):\n\n107\u2013136, 2006.\n\n[37] Mrinmaya Sachan and Eric Xing. Parsing to programs: A framework for situated qa. In\nProceedings of the 24rd SIGKDD Conference on Knowledge Discovery and Data Mining\n(KDD), 2018.\n\n[38] Mrinmaya Sachan, Kumar Dubey, and Eric Xing. From textbooks to knowledge: A case study\nin harvesting axiomatic knowledge from textbooks to solve geometry problems. In Proceedings\nof the 2017 Conference on Empirical Methods in Natural Language Processing, pages 773\u2013784,\n2017.\n\n[39] Mrinmaya Sachan, Minjoon Seo, Hannaneh Hajishirzi, and Eric Xing. Parsing to programs: A\n\nframework for situated question answering. 2018.\n\n[40] Min Joon Seo, Hannaneh Hajishirzi, Ali Farhadi, and Oren Etzioni. Diagram understanding in\n\ngeometry questions. In Proceedings of AAAI, 2014.\n\n11\n\n\f[41] Min Joon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren Etzioni, and Clint Malcolm. Solving\ngeometry problems: combining text and diagram interpretation. In Proceedings of EMNLP,\n2015.\n\n[42] V. Shet, M. Singh, C. Bahlmann, V. Ramesh, J. Neumann, and L. Davis. Predicate logic based\n\nimage grammars for complex pattern recognition. IJCV, 93(2):141\u2013161, 2011.\n\n[43] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[44] Richard Socher, Cliff C Lin, Chris Manning, and Andrew Y Ng. Parsing natural scenes and\nnatural language with recursive neural networks. In Proceedings of the 28th international\nconference on machine learning (ICML-11), pages 129\u2013136, 2011.\n\n[45] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.\nDropout: A simple way to prevent neural networks from over\ufb01tting. The Journal of Machine\nLearning Research, 15(1):1929\u20131958, 2014.\n\n[46] Tian Tian and Jun Zhu. Max-margin majority voting for learning from crowds. In Advances in\n\nNeural Information Processing Systems, pages 1621\u20131629, 2015.\n\n[47] Geoffrey G Towell and Jude W Shavlik. Knowledge-based arti\ufb01cial neural networks. Arti\ufb01cial\n\nintelligence, 70(1-2):119\u2013165, 1994.\n\n[52] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang\nMacherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google\u2019s neural machine\ntranslation system: Bridging the gap between human and machine translation. arXiv preprint\narXiv:1609.08144, 2016.\n\n[53] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\n\ndeep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.\n\n[54] Ziming Zhang, Jonathan Warrell, and Philip HS Torr. Proposal generation for object detection\nusing cascaded ranking svms. In Computer Vision and Pattern Recognition (CVPR), 2011,\npages 1497\u20131504. IEEE, 2011.\nTrends R(cid:13) in Computer Graphics and Vision, 2(4):259\u2013362, 2007.\n\n[55] Song-Chun Zhu, David Mumford, et al. A stochastic grammar of images. Foundations and\n\n[56] C Lawrence Zitnick and Piotr Doll\u00e1r. Edge boxes: Locating object proposals from edges. In\n\nEuropean Conference on Computer Vision, pages 391\u2013405. Springer, 2014.\n\n[48] Zhuowen Tu, Xiangrong Chen, Alan L Yuille, and Song-Chun Zhu. Image parsing: Unifying\nsegmentation, detection, and recognition. International Journal of computer vision, 63(2):\n113\u2013140, 2005.\n\n[49] Jasper RR Uijlings, Koen EA van de Sande, Theo Gevers, and Arnold WM Smeulders. Selective\nsearch for object recognition. International journal of computer vision, 104(2):154\u2013171, 2013.\n[50] Paul Viola and Michael J Jones. Robust real-time face detection. International journal of\n\ncomputer vision, 57(2):137\u2013154, 2004.\n\n[51] Henning Wachsmuth. Text analysis pipelines: towards ad-hoc large-scale text mining, volume\n\n9383. Springer, 2015.\n\n12\n\n\f", "award": [], "sourceid": 99, "authors": [{"given_name": "Mrinmaya", "family_name": "Sachan", "institution": "Carnegie Mellon University"}, {"given_name": "Kumar Avinava", "family_name": "Dubey", "institution": "Carnegie Mellon University"}, {"given_name": "Tom", "family_name": "Mitchell", "institution": "Carnegie Mellon University"}, {"given_name": "Dan", "family_name": "Roth", "institution": "UPenn"}, {"given_name": "Eric", "family_name": "Xing", "institution": "Petuum Inc. /  Carnegie Mellon University"}]}