{"title": "Minimizing Uncertainty in Pipelines", "book": "Advances in Neural Information Processing Systems", "page_first": 2942, "page_last": 2950, "abstract": "In this paper, we consider the problem of debugging large pipelines by human labeling. We represent the execution of a pipeline using a directed acyclic graph of AND and OR nodes, where each node represents a data item produced by some operator in the pipeline. We assume that each operator assigns a confidence to each of its output data. We want to reduce the uncertainty in the output by issuing queries to a human expert, where a query consists of checking if a given data item is correct. In this paper, we consider the problem of asking the optimal set of queries to minimize the resulting output uncertainty. We perform a detailed evaluation of the complexity of the problem for various classes of graphs. We give efficient algorithms for the problem for trees, and show that, for a general dag, the problem is intractable.", "full_text": "Minimizing Uncertainty in Pipelines\u2217\n\nNilesh Dalvi\nFacebook, Inc.\nnileshd@fb.com\n\nAditya Parameswaran\n\nStanford University\n\nadityagp@cs.stanford.edu\n\nVibhor Rastogi\n\nGoogle, Inc.\n\nvibhor.rastogi@gmail.com\n\nAbstract\n\nIn this paper, we consider the problem of debugging large pipelines by human\nlabeling. We represent the execution of a pipeline using a directed acyclic graph\nof AND and OR nodes, where each node represents a data item produced by some\noperator in the pipeline. We assume that each operator assigns a con\ufb01dence to\neach of its output data. We want to reduce the uncertainty in the output by issuing\nqueries to a human, where a query consists of checking if a given data item is\ncorrect. In this paper, we consider the problem of asking the optimal set of queries\nto minimize the resulting output uncertainty. We perform a detailed evaluation of\nthe complexity of the problem for various classes of graphs. We give ef\ufb01cient\nalgorithms for the problem for trees, and show that, for a general dag, the problem\nis intractable.\n\nIntroduction\n\n1\nIn this paper, we consider the problem of debugging pipelines consisting of a set of data processing\noperators. There is a growing interest in building various web-scale automatic information extraction\npipelines [9, 10, 14, 7], with operators such as clustering, extraction, classi\ufb01cation, and deduplica-\ntion. The operators are often based on machine learned models, and they associate con\ufb01dences with\nthe data items they produce. At the end, we want to resolve the uncertainties of the \ufb01nal output\ntuples, i.e., \ufb01gure out which of them are correct and which are incorrect.\n.5Given a \ufb01xed labeling budget, we can only inspect a sub-\nset of the output tuples. However, the output uncertainties\nare highly correlated since different tuples share their lin-\neage. Thus, inspecting a tuple also gives us information\nabout the correctness of other tuples.\nIn this paper, we\nconsider the following interesting and non-trivial problem\n: given a budget of k tuples, choose the k tuples to inspect\nthat minimize the total uncertainty in the output. We will\nformalize the notion of a data pipeline and uncertainty in\nSection 2. Here, we illustrate the problem using an example.\nExample 1.1. Consider a simple hypothetical pipeline for extracting computer scientists from the\nWeb that consists of two operators: a classi\ufb01er that takes a webpage and determines if it is a page\nabout computer science, and a name extractor that extracts names from a given webpage. Fig. 1\nshows an execution of this pipeline. There are two webpages, w1 and w2, output by the classi\ufb01er.\nThe extractor extracts entities e1 and e2 from w1 and e3,e4 and e5 from w2. Each operator also gives\na con\ufb01dence with its output. In Fig. 1, the classi\ufb01er attaches a probability of 0.9 and 0.8 to pages\nw1 and w2. Similarly, the extractor attaches a probability to each of the extractions e1 to e5. The\nprobability that an operator attaches to a tuple is conditioned on the correctness of its input. Thus,\nthe \ufb01nal probability of e1 is 0.8\u00d7 0.9 = 0.72. Similarly, the \ufb01nal probabilities of e2 to e5 are 0.45,\n0.8, 0.8 and 0.48 respectively. Note that the uncertainties are correlated, e.g., e3 and e4 are either\nboth correct or both incorrect. We want to choose k tuples to inspect that minimize the total output\nuncertainty.\n\nFigure 1: Pipeline Example\n\n\u2217This work was partly done when the authors were employed at Yahoo! Research.\n\n1\n\n0.80.50.80.6110.9e3e4e5e1e2Classi\ufb01ed PagesExtractionsw2w1\fGraph\nTREE(2)\n\nBEST-1\nO(n)\n\nTREE\nDAG(2,\u2227) or\nDAG(\u2227)\nDAG(2,\u2228)\nDAG(\u2228)\nDAG\n\nO(n)\nO(n3)\n\nO(n3)\nPP-Hard\n\nINCR\nO(n) or\n\nO(logn)+O(nlogn) preprocessing\n\nO(n)\n\nBEST-K\n\nOPEN (Weakly PTIME)\n2-approximate\u2020: O(nlogn)\n\nOPEN (O(nk+1))\n\nPP-hard (Probabilistic Polynomial)\n\nHard to Approximate\n\nPP-Hard\n\nHard to Approximate\n\nPP-hard, Hard to Approximate\n\nPP-Hard, Hard to Approximate\n\nPP-Hard, Hard to Approximate\n\nPP-Hard, Hard to Approximate\n\nTable 1: Summary of Results; \u2020Twice the number of queries to achieve same objective as optimal\n\nIf all the data items were independent, we would have queried the most uncertain items, i.e. having\nprobability closest to 1/2. However, in presence of correlations between the output tuples, the prob-\nlem becomes non-trivial. For instance, let us revisit the \ufb01rst example with k = 1, i.e., we can inspect\none tuple. Of the 5 output tuples, e5 is the most uncertain, since its probability 0.48 is closest to\n1/2. However, one might argue that e3 (or e4) is more informative item to query, since the extractor\nhas a full con\ufb01dence on e3. Thus, e3 is correct iff w2 is correct (i.e. the classi\ufb01er was correct on\nw2). Resolving e3 completely resolves the uncertainty in w2, which, in turn, completely resolves\nthe uncertainty in e4 and reduces the uncertainty in e5. The argument holds even when the extractor\ncon\ufb01dence in e3 is less than 1 but still very high. In general, one can also query intermediate nodes\nin addition to the output tuples, and choosing the best node is non-trivial.\nIn this paper, we consider the general setting of a data pipeline given by a directed acyclic graph\nthat can capture both the motivating scenarios. We de\ufb01ne a measure of total uncertainty of the \ufb01nal\noutput based on how close the probabilities are to either 0 or 1. We give ef\ufb01cient algorithms to\n\ufb01nd the set of data items to query that minimizes the total uncertainty of the output, both under\ninteractive and batch settings.\n\n1.1 Related Work\nOur problem is an instance of active learning [27, 13, 12, 17, 2, 15, 5, 4, 3] since our goal is to infer\nprobability values of the nodes being true in the DAG, by asking for tags of example nodes. The\nmetric that we use is similar to the square loss metric. However, our problem has salient differences.\nUnlike traditional active learning where we want to learn the underlying probabilistic model from\niid samples, in our problem, we already know the underlying model and want to gain information\nabout non-iid items with known correlations. This makes our setting novel and interesting.\nOur DAG structure is a special case of Bayesian networks [6]. A lot is known about general bayes-\nnet inference [21]. For instance, MAP inference given evidence is NPPP-complete [24] (approxi-\nmate inference is NP-complete [1]), inferring whether the probability of a set of variables taking a\ncertain values given evidence about others is > 0 is NP-complete [8], is > t is PP-complete [22],\nwhile \ufb01nding its values is #P-complete [26]. However, these results do not apply to our problem set-\nting. In our setting, we are given a set of non-iid items whose correlatations are given by a Bayesian\nnetwork with known structure and probabilities. We want to choose a subset of items, conditioned\non which, the uncertinty of the remaining items is minimized.\nOur work is closely related to the \ufb01eld of active diagnosis [28, 19, 20], where the goal is to infer\nthe state of unknown nodes in a network by selecting suitable \u201ctest probes\u201d. From this \ufb01eld, the\nmost closely related work is that by Krause and Guestrin [19], which considers minimization of\nuncertainty in a Bayesian network. In that work, the goal is to identify a subset of variables in a\ngraphical model that would minimize the joint uncertainty of a target set of variables. Their primary\nresult is a proof of submodularity under suitable independence assumptions on the graphical model\nwhich is then used to derive an approximation algorithm to pick variables. In our problem setting\nsubmodularity does not hold, and hence the techniques do not apply. On the other hand, since our\ngraphical model has a speci\ufb01c AND/OR structure, we are able to concretely study the complexity\nof the algorithms. Our work is also related to the work on graph search [23], where the goal is to\nidentify hidden nodes while asking questions to humans. Since the target applications are different,\nthe underlying model in that work is less general.\n\n2 Problem Statement\n\nExecution Graph: Let G be a directed acyclic graph (dag), where each node n in G has a label from\nthe set {\u2227,\u2228} and a probability p(n). We call such a graph a probabilistic and-or dag. We denote\n\n2\n\n\fthe class of such graphs as DAG. We represent the results of an execution of a pipeline of operators\nusing a probabilistic and-or dag.\nThe semantics of G \u2208 DAG is as follows. Each node in G represents a data item. The parents of a\nnode n, i.e. the set of nodes having an outgoing edge to n, denote the set of data items which were\ninput to the instance of the operator that produced n. We use parent(n) to denote the parents of n.\nThe probability p(n) denotes the probability that the data item n is correct conditioned on parent(n)\nbeing correct. If n has label \u2227, then it requires all the parents to be correct. If n has label \u2228, it\nrequires at least one parent to be correct. We further assume that, conditioned on the parents being\ncorrect, nodes are correct independently.\nTo state the semantics formally, we associate a set of independent Boolean random variables X(n)\nis de\ufb01ned as: Y (n) = X(n)\u2227(cid:86)\nfor each node n in G with probability p(n). We also associate another set of random variables\nY (n), which denotes whether the result at node n is correct (unconditionally). For a \u2227 node, Y (n)\n(cid:87)\nm\u2208parent(n)Y (m). For a \u2228 node, Y (n) is de\ufb01ned as: Y (n) = X(n)\u2227\nm\u2208parent(n)Y (m).\nWhen G is a tree, i.e., all nodes have a single parent, the labels of nodes do not have any effect,\nsince Y (n) is the same for both \u2227 and \u2228 nodes. In this case, we simply treat G as an unlabeled\ntree. For instance, Figure 1 denotes the (unlabeled) tree for the pipeline given in Example 1.1. Thus\nprobabilistic and-or dags provide a powerful formalism to capture data pipelines in practice such as\nthe one in Example 1.1.\nOutput Uncertainty: Let L denote the set of leaves of G, which represent the \ufb01nal output of the\npipeline. We want all the \ufb01nal probabilities of L to be close to either 0 or 1, as the closer the\nprobability to 1/2, the more uncertain the correctness of the given node is. Let f (p) denote some\nmeasure of uncertainty of a random variable as a function of its probability p. Then, we de\ufb01ne the\ntotal output uncertainty of the DAG as\n\nI = \u2211\nn\u2208L\n\nf (Pr(Y (n)))\n\n(1)\nOur results continue to hold when different n \u2208 L are weighted differently, i.e., we use a weighted\nversion of Eq. (1). We describe this simple extension in the extended technical report [11].\nNow, our goal is to query a set of nodes Q that minimize the expected total output uncertainty\nconditioned on observing Q. We de\ufb01ne this as follows. Let Q = {l1,l2,\u00b7\u00b7\u00b7 ,lk} be a set of nodes.\nGiven v = {v1,\u00b7\u00b7\u00b7 ,vk} \u2208 {0,1}k, we use Q = v to denote the event Y (li) = vi for each i. Then, de\ufb01ne\n(2)\n\nf (Pr(Y (n) | Q = v))\n\nI(Q) = \u2211\nv\u2208{0,1}k\n\nPr(Q = v) \u2211\nn\u2208L\n\nThe most basic version of our problem is following.\n\nProblem 1 (Best-1). Given a G \u2208 DAG, \ufb01nd the node q that minimizes the expected uncertainty\nI({q}).\n\nA more challenging question is the following:\n\nProblem 2 (Best-k). Given a G \u2208 DAG, \ufb01nd the set of nodes Q of size k that minimizes I(Q).\n\nIn addition to this, we also consider the incremental version of the problem de\ufb01ned as follows.\nSuppose we have already issued a set of queries Q0 and obtained a vector v0 of their correctness\nvalues. Given a new set of queries, we de\ufb01ne the conditioned uncertainty as I(Q | Q0 = v0) =\n\u2211v Pr(Q = v | Q0 = v0)\u2211n\u2208L f (Pr(Y (n) | Q = v\u2227 Q0 = v0)). We also pose the following question:\nProblem 3 (Incr). Given a G \u2208 DAG, and a set of already issued queries Q0 with answer v0,\n\ufb01nd the best node q to query next that minimizes I({q} | Q0 = v0).\n\nIn this work, we use the uncertainty metric given by\n\nf (p) = p(1\u2212 p)\n\n(3)\nThus, f (p) is minimized when p is either 0 or 1, and is maximum at p = 1/2. Note that f (p) =\n1/4\u2212 (1/2\u2212 p)2. Hence, minimizing f (p) is equivalent to maximizing the squares of differences\nof probabilities with 1/2. We call this the L2 metric. There are other reasonable choices for the\nuncertainty metric, e.g. L1 or entropy. The actual choice of uncertainty metrics is not important for\nour application. In the technical report [11], we show that using any of these different metrics, the\nresulting solutions are \u201csimilar\u201d to each other.\nOur uncertainty objective function can be shown to satisfy some desirable properties, such as:\n\n3\n\n\fTheorem 2.1 (Information Never Hurts). For any sets of queries Q1, Q2, I(Q1) \u2265 I(Q1 \u222a Q2)\nThus, expected uncertainty cannot increase with more queries. Further, the objective function I is\nneither sub-modular nor super-modular. These results continue to hold when f is replaced with\nother metrics (Sec. 6). Lastly, for the rest of the paper, we will assume that the query nodes Q are\nselected from only among the leaves of G. This is only to simplify the presentation. There is a\nsimple reduction of the general problem to this problem, where we attach a new leaf node to every\ninternal node, and set their probabilities to 1. Thus, for any internal node, we can equivalently query\nthe corresponding leaf node (we will need to use the weighted form of the Eq. (1), described in the\nextended technical report [11], to ensure that new leaf nodes have weight 0 in the objective function.)\n\n3 Summary of main results\nWe \ufb01rst de\ufb01ne class of probabilistic and-or dags. Let DAG(\u2227) and DAG(\u2228) denote the subclasses of\nDAG where all the node labels are \u2227 and \u2228 respectively. Let DAG(2,\u2227) and DAG(2,\u2228) denote the\nsubclasses where the dags are further restricted to depth 2. (We de\ufb01ne the depth to be the number of\nnodes in the longest root to leaf directed path in the dag.) Similarly, we de\ufb01ne the class TREE where\nthe dag is restricted to a tree, and TREE(d), consisting of depth-d trees. For trees, since each node\nhas a single parent, the labels of the nodes do not matter.\nWe start by de\ufb01ning relationships between expressibility of each of these classes. Given any\nD1,D2 \u2208 DAG, we say that D1 \u2261 D2 if they have the same number of leaves, and de\ufb01ne the same\njoint probability distribution on the set of their leaves. Given two classes of dags C1 and C2, we say\nC1 \u2282 C2 if for all D1 \u2208 C1, there is a D2 \u2208 C2 s.t. D2 is polynomial in the size of D1 and D1 \u2261 D2.\nTheorem 3.1. The following relationships exist between different classes:\n\nTREE(2) \u2282 TREE \u2282 DAG(2,\u2227) = DAG(\u2227) \u2282 DAG(2,\u2228) \u2282 DAG(\u2228) \u2282 DAG\n\nTable 1 shows the complexity of the three problems as de\ufb01ned in the previous section, for different\nclasses of graphs. The parameter n is the number of nodes in the graph. While the problems are\ntractable, and in fact ef\ufb01cient, for trees, they become hard for general dags. Here, PP denotes the\ncomplexity class of probabilistic polynomial time algorithms. Unless P = NP, there are no PTIME\nalgorithms for PP-hard problems. Further, for some of the problems, we can show that they cannot\nbe approximated within a factor of 2n1\u2212\u03b5 for any positive constant \u03b5 in PTIME.\n4 Best-1 Problem\nWe start with the most basic problem: given a probabilistic DAG G, \ufb01nd the node to query that min-\nimizes the resulting uncertainty. We \ufb01rst provide PTIME algorithms for TREE(2), TREE, DAG(\u2227),\nand DAG(2,\u2228) (Recall that as we saw earlier, DAG(2,\u2228) subsumes DAG(\u2227).) Subsequently, we\nshow that \ufb01nding the best node to query is intractable for DAG(\u2228) of depth greater than 2, and is\nthus intractable for DAG as well. For TREE and DAG(\u2227), the expression for Y (n) can be rewritten\nm\u2208anc(n) X(m), where anc(n) denotes the set of ancestors of n, i.e., those\nnodes that have a directed path to n, including n itself. This \u201cunrolled\u201d formulation will allow us to\ncompute the probabilities Y (x) = 1 easily.\n\nas the following: Y (n) =(cid:86)\n\n4.1 TREE(2)\nConsider a simple tree graph G with root r, having p(r) = pr, and having children l1,\u00b7\u00b7\u00b7 ,ln with\np(li) = pi. Given a node x, let ex denote the event Y (x) = 1, and ex denote the event that Y (x) = 0.\nWe want to \ufb01nd the leaf q that minimizes I({q}), where:\n\nI({q}) = \u2211\nl\u2208L\n\nPr(eq) f (Pr(el | eq)) + Pr(eq) f (Pr(el | eq))\n\n(4)\n\nBy a slight abuse of notation, we will use I(q) to denote the quantity I({q}). It is easy to see the\nfollowing (let l (cid:54)= q):\nPr(eq) = pr pq,\n\nPr(el | eq) = pr pl(1\u2212 pq)/(1\u2212 pr pq)\n\nPr(el | eq) = pl,\n\nSubstituting these expressions back in Eq. (4), and assuming f (p) = p(1\u2212 p), we get the following:\n\nI(q) = \u2211\nl\u2208L,l(cid:54)=q\n\npr pq pl(1\u2212 pl) + pr pl(1\u2212 pq)(1\u2212 pr pl(1\u2212 pq)/(1\u2212 pr pq))\n\n4\n\n\fWe observe that it is of the form\n\nF0(pq, pr) + F1(pq, pr)\u2211\n\nl\n\npl + F2(pq, pr)\u2211\n\nl\n\np2\nl\n\n(5)\n\nwhere F0,F1,F2 are small rational polynomials over pr and pq. This immediately gives us a linear\ntime algorithm to pick the best q. We \ufb01rst compute \u2211l pl and \u2211l p2\nl , and then compute the objective\nfunction for all q in linear time.\nNow we consider the case when G is any general tree with the set of leaves L. Recall that ex is the\nevent that denotes Y (x) = 1. Denote the probability Pr(ex) by Px. Thus, Px is the product of p(y)\nover all nodes y that are the ancestors of x (including x itself). Given nodes x and y, let lca(x,y)\ndenote the least common ancestor of x and y. Our objective is to \ufb01nd q \u2208 L that minimizes Eq. (4).\nThe following is immediate:\n\nPr(eq) = Pq\n\nPr(el | eq) =\n\nPl\n\nPlca(l,q)\n\nPr(el | eq) =\n\nPl(1\u2212 Pq/Plca(l,q))\n\n1\u2212 Pq\n\nHowever, if we directly plug this in Eq.(4), we don\u2019t get a simple form analogous to Eq.(5). Instead,\nwe group all the leaves into equivalence classes based on their lowest common ancestor with q as\nshown in Fig. 2.\nLet a1,\u00b7\u00b7\u00b7 ,ad be the set of ancestors of q. Consider all leaves in the set Li such that their low-\nest common ancestor with q is ai. Given a node x, let S(x) denote the sum of P2\nl over all leaves\nIf we sum Eq. (4) over all leaves in Li, we get the following expression:\nl reachable from x.\n\n\u2212(S(ai)\u2212 S(ai\u22121))\n\n(Pq + P2\n\nai \u2212 2PqPai)\nai(1\u2212 Pq)\nP2\n\n+ \u2211\nl\u2208Li\n\nPl\n\nDe\ufb01ne \u22061(ai) = S(ai)\u2212 S(ai\u22121) and \u22062(ai) = (S(ai)\u2212\nS(ai\u22121))\n. We can write the above expression as:\n\n1\u22122Pai\nP2\nai\n\u2212 1\n1\u2212 Pq\n\n\u22061(ai)\u2212 Pq\n1\u2212 Pq\n\n\u22062(ai) + \u2211\nl\u2208Li\n\nPl\n\nFigure 2: Equivalence Classes of Leaves\n\nSumming these terms over all the ancestors of q, we\nget\n\u22061(a)\u2212 Pq\n\nI(q) = \u2212 1\n\n1\u2212 Pq \u2211\na\u2208anc(q)\n\n1\u2212 Pq \u2211\na\u2208anc(q)\n\n\u22062(a) + \u2211\nl\u2208L\n\nPl\n\n4.2 TREE\nOur main observation is that we can compute I(q) for all leaves together in time linear in the size\nof G. First, using a single top-down dynamic programming over the tree, we can compute Px for all\nnodes x. Next, using a single bottom-up dynamic programming over G, we can compute S(x) for all\nnodes x. In the third step, we compute \u22061(x) and \u22062(x) for all nodes in the tree. In the fourth step, we\ncompute \u2211a\u2208anc(x) \u2206i(x) for all nodes in the graph using another top-down dynamic programming.\nFinally, we scan all the leaves and compute the objective function using the above expression. Each\nof the 5 steps runs in time linear in the size of the graph. Thus, we have\nTheorem 4.1. Given a tree G with n nodes, we can compute the node q that minimizes I(q) is time\nO(n).\n4.3 DAG(2,\u2228)\nWe now consider DAG(2,\u2228). As before, we want to \ufb01nd the best node q that minimizes I(q) as\ngiven by Eq. (4). However, the expressions for probabilities Pr(eq) and Pr(el | eq) are more complex\nfor DAG(2,\u2228). First, note that Pl, i.e., the probability that Pr(Y (l) = 1) is computed as follows:\nof l and q are true is: Pl,q = 1\u2212 \u220fx\u2208parent(l)\u2229parent(q) (1\u2212 p(x)). And the probability that one of the\nunique ancestors of l is true is: Pl\\q = 1\u2212 \u220fx\u2208parent(l)\\parent(q) (1\u2212 p(x)) Then, the following are\n\nPl = p(l)\u00d7(cid:0)1\u2212 \u220fx\u2208parent(l) (1\u2212 p(x))(cid:1). The probability that at least one of the shared ancestors\n\n5\n\nqa1a2adL1LdL2\fimmediate:\n\nPr(eq) = Pq\n\nPr(eq | el) =\n\nPr(eq | el) =\n\np(l)\u00b7 p(q)\u00b7 (Pl,q + (1\u2212 Pl,q)\u00b7 Pl\\q \u00b7 Pq\\l)\nPq \u00b7 (1\u2212 p(l)) + p(l)\u00b7 p(q)\u00b7 (1\u2212 Pl,q)\u00b7 (1\u2212 Pl\\q)\u00b7 Pq\\l\n\nPl\n\n1\u2212 Pl\n\nNote that Pl,Pl,q,Pl\\q can be computed for one l,q pair in time O(n) and thus for all l,q in time\nO(n3). Subsequently, \ufb01nding the best candidate node would require O(n2) time, giving us an overall\nO(n3) algorithm to \ufb01nd the best node.\nTheorem 4.2. Given G \u2208 DAG(2,\u2228) with n nodes, we can compute q that minimizes I(q) is time\nO(n3).\nSince every DAG(\u2227) can be converted into to one in DAG(2,\u2228) in O(n3) (see [11]), we get:\nTheorem 4.3. Given G \u2208 DAG(\u2227) with n nodes, we can compute q that minimizes I(q) is time\nO(n3).\n4.4 DAG(\u2228)\nTheorem 4.4 (Hardness of Best-1 for DAG(\u2228)). The best-1 problem for DAG(\u2228) is PP-Hard.\nWe use a reduction from the decision version of the #P-Hard monotone-partitioned-2-DNF prob-\nlem [25]. The proof can be found in the extended technical report [11]. Thus, incremental and best-k\nproblems for DAG(\u2228) are PP-Hard as well. As a corollary from Theorem 3.1 we have:\nTheorem 4.5 (Hardness of Best-1 for DAG). The best-1 problem for DAG is PP-Hard.\nThis result immediately shows us that the incremental and best-k problems for DAG are PP-Hard.\nHowever, we can actually prove a stronger result for DAG, i.e., that they are hard to approximate. We\nuse a weakly parsimonious reduction from the #P-Hard monotone-CNF problem. Note that unlike\nthe partitioned-2-DNF problem (used for the reduction above), which admits a FPRAS (Fully Poly-\nnomial Randomized Approximation Scheme) [18], monotone-CNF is known to be hard to approx-\nimate [26]. In our proof, we use the fact that repeated applications of an approximation algorithm\nfor best-1 for DAG would lead to an approximation algorithm for monotone-CNF, which is known\nto be hard to approximate. This result is shown in the extended version [11].\nTheorem 4.6 (Inapproximability for DAG). The best-1 problem for DAG is hard to approximate.\n\nIncremental Node Selection\n\n5\nIn this section, we consider the problem of picking the next best node to query after a set of nodes\nQ0 have already been queried. We let vector v0 re\ufb02ect their correctness values. We next pick a\nleaf node q that minimizes I({q} | Q0 = v0). Again, by slightly abusing notation, we will write the\nexpression simply as I(q | Q0 = v0).\nIn this section, we \ufb01rst consider TREE(2) and TREE. Recall from the previous section that the in-\ncremental problem is intractable for DAG(\u2228). Here, we prove that incremental picking is intractable\nfor DAG(\u2227) itself.\n5.1 TREE\nWe want to extend our analysis of Sec. 4 by replacing Pr(ex) by Pr(ex | Q0 = v0) and Pr(ex | ey)\nby Pr(ex | ey \u2227 Q0 = v0). We will show that, conditioned on Q0 = v0, the resulting probability\ndistribution of the leaves can again be represented using a tree. The new tree is constructed as\nfollows.\nGiven Q0 = v0, apply a sequence of transformations to G \u2208 TREE, one for each q0 \u2208 Q0. Suppose\nthe value of q0 = 1. Then, for each ancestor a of q0 including itself, set p(a) = 1. If q0 = 0, then for\neach ancestor a including itself, change its p(a) to p(a)\n. Let all other probabilities remain\nthe same.\nTheorem 5.1. Let G(cid:48) be the tree as de\ufb01ned above. Then, I(q | Q0 = v0) on G is equal to I(q) on G(cid:48).\nThus, after each query, we can incorporate the new evidence by updating the probabilities of all the\nnodes along the path from the query node to the root. Thus, \ufb01nding the next best node to query can\nstill be computed in linear time.\n\n1\u2212Pq0 /Pa\n1\u2212Pq0\n\n6\n\n\f5.2 TREE(2)\nFor G \u2208 TREE(2), the above algorithm results in the following tree transformation. If a leaf q is\nqueried, and the result is 1, then p(r) and p(q) are set to 1. If the result is 0, p(q) is set to 0 and p(r)\nis set to pr(1\u2212pq)\n1\u2212pr pq .\nInstead of using Eq. (5) to compute the next best in linear time, we can devise a more ef\ufb01cient\nscheme. Suppose we are given all the leaf probabilities in sorted order (or if we sort them initially).\nThen, we can subsequently compute the leaf q that minimizes Eq. (5) in O(logn) time: Consider the\nrational polynomials F0,F1 and F2. For a \ufb01xed pr, \u2211l pl, and \u2211l p2\nl , this expression can be treated as\na rational polynomial in a single variable pq. If we take the derivative, the numerator is a quartic in\npq. Thus, it can have at most four roots. We can \ufb01nd the roots of a quartic using Ferrari\u2019s approach\nin constant time [16]. Using 4 binary searches, we can \ufb01nd the two pq closest to each of these roots\n(giving us 8 candidates for pq, plus two more which are the smallest and the largest pq), and evaluate\nI(q) for each of those 10 candidates. Thus, \ufb01nding the best q takes O(logn) time.\nNow, given each new evidence (i.e., the answer to each subsequent query), we can update the pr\nprobability and the sum \u2211l p2\nl in constant time. Given the new polynomial, we can \ufb01nd the new set\nof roots, and using the same technique as above, \ufb01nd the next best q in O(logn) time.\nTheorem 5.2. If the p values of the leaf nodes are provided in sorted order, then, for a Depth-2 tree,\nthe next best node to query can be computed in O(logn).\n5.3 DAG(\u2227)\nFor DAG(\u2227), while we can pick the best-1 node in O(n3) time, we have the surprising result that\nthe problem of picking subsequent nodes become intractable. The intuition is that unlike trees, after\nconditioning on a query node, the resulting distribution can no longer be represented using another\ndag. In particular, we show that given a set S of queried nodes, the problem of \ufb01nding the next best\nnode is intractable in the size of S. We use a reduction from the monotone-2-CNF problem.\nTheorem 5.3 (PP-Hardness of Incr. for DAG(\u2227)). The incremental problem in DAG(\u2227) is PP-Hard.\nOur reduction, shown in in the extended technical report [11], is a weakly parsimonious reduction\ninvolving monotone-2-CNF, which is known to be hard to approximate, thus we have the following\nresult:\nTheorem 5.4 (Inapproximability for DAG(\u2227)). The Incremental problem for DAG(\u2227) is hard to\napproximate.\nThe above result, along with Theorem 3.1, implies that DAG(2,\u2228) is also PP-Hard.\n\n6 Best-K\nIn this section, we consider the problem of picking the best k nodes to minimize uncertainty.\nKrause et al. [19] give a logn approximation algorithm for a similar problem under the conditions\nof super-modularity: super-modularity states that the marginal decrease in uncertainty when adding\na single query node to an existing set of query nodes decreases as the set becomes larger. Here,\nwe show that super-modularity property does not hold in our setting, even for the simplest case of\nTREE. In fact, for DAG(2,\u2227), the problem is hard to approximate within a factor of O(2n1\u2212\u03b5\n) for\nany \u03b5 > 0. We show that TREE(2) admits a weakly-polynomial exact algorithm and a polynomial\napproximation algorithm. For general trees, we leave the complexity problem open.\n\nPicking Nodes Greedily: First, we show that picking greedily can be arbitrarily bad. Con-\nsider a tree with root having p(r) = 1/2. There are 2n leaves, half with p = 1 and rest with p = 1/2.\nIf we pick any leaf node with p = 1, the expected uncertainty is n/8. If we pick a node with p = 1/2,\nthe expected uncertainty is 25n/16\u2212 4/16. Thus, if we sort nodes by their expected uncertainty, all\nthe p = 1 nodes appear before all the p = 1/2 nodes. Consider the problem of picking the best n\nnodes. If we pick greedily based on their expected uncertainty, we pick all the p = 1 nodes. How-\never, all of them are perfectly correlated. Thus, expected uncertainty after querying all p = 1 nodes\nis still n/8. On the other hand, if we pick a single p = 1 node, and n\u2212 1 nodes with p = 1/2, the\nresulting uncertainty is a constant. Thus, picking nodes greedily can be O(n) worse than the optimal.\nCounter-example for super-modularity: Next we show an example from a graph in DAG(2,\u2227)\nwhere super-modularity does not hold. Consider a G \u2208 DAG(2,\u2227) having two nodes u and v on\n\n7\n\n\fthe top layer and three nodes a, b, and c in the bottom layer. Labels of all nodes are \u2227. Node\nu has an edge to a and b, while v has an edge to b and c. Let Pr(u) = 1/2, Pr(v) = 1/2, and\nPr(a) = Pr(b) = Pr(c) = 1. Now consider the expected uncertainty Ic at node c. Super-modularity\ncondition implies that Ic({b,a})\u2212 Ic({b}) \u2265 Ic({a})\u2212 Ic({}) (since marginal decrease in expected\nuncertainty of c on picking an additional node a should be less for set {} compared to {b}). We\nshow that this is violated. First note that Pr(Y (c)|Y (a)) is same as Pr(Y (c)) (since Y (a) does not\naffect Y (v) and Y (c)). Thus expected uncertainty at c is unaffected by conditioning on a alone, and\nthus Ic({a}) = Ic({}). On the other hand, if Y (b) = 0 and Y (a) = 1 then Y (c) = 0 (since Y (a) = 1\nimplies Y (u) = 1 which together with Y (b) = 0 implies Y (v) = 0 and Y (c) = 0). This can be used\nto show that conditioned on Y (b), expected uncertainty in c drops when conditioning on Y (a). Thus\nthe term Ic({b,a})\u2212 Ic({b}) is negative, while we showed that Ic({a})\u2212 Ic({}) is 0. This violates\nthe super-modularity condition.\nThe above example actually shows that super-modularity is violated on DAG(\u2227) for any choice of\nmetric f in computing expected uncertainty I, as long as f is monotonic decreasing away from 1/2.\nWhen f (p) = p(1\u2212 p), we can show that super-modularity is violated even for trees as stated in the\nproposition below.\nProposition 6.1. Let f (p) = p(1\u2212 p) be the metric used in computing expected uncertainty I. Then\nthere exists a tree T \u2208 TREE(d) such that for leaf nodes a , b, and c in T the following holds:\nIc({b,a})\u2212 Ic({b}) < Ic({a})\u2212 Ic({}).\n6.1 TREE(2)\nWe now consider the Best-k problem for TREE(2). As in Section 4, assume the root r with p(r) to be\npr, while the leaves L = {l1, . . . ,ln} have p(li) = pi. Let B = \u2211l\u2208L p2(l). Given a set Q \u2286 L, de\ufb01ne\n\nP(Q) = \u220f\nl\u2208Q\n\np(l)\n\nS1(Q) = \u2211\nl\u2208Q\n\np(l)(1\u2212 p(l))\n\np2(l)\n\nS2(Q) = \u2211\nl\u2208Q\nI(cid:48)(Q) = \u2212S1(Q) + (B \u2212\n\n1\u2212pr\n\n(1\u2212pr)/P(Q)+pr\n\nLemma 6.2. The best set Q of size k is one that minimizes:\nS2(Q))\n(The details of this computation is shown in the extended technical report.) It is easy to check that\nthat the \ufb01rst term is minimized with Q consists of nodes with p(l) closest to 1/2, and the second\nterm is minimized with nodes with p(l) closest to 1. Intuitively, the \ufb01rst term prefers nodes that are\nas uncertain as possible, while the second term prefers nodes that reveal as much about the root as\npossible. This immediately gives us a 2-approximation in the number of queries : by picking at most\n2k nodes, k closest to 1/2 and k closest to 1, we can do at least as well as the optimal solution for\nbest-k.\nExact weakly-polynomial time algorithm: Note also that as k increases, P(Q) \u2192 0, and the second\nterm vanishes. This also makes intuitive sense, since the second term prefers nodes that reveal more\nabout the root, and once we use suf\ufb01ciently many nodes to infer the correctness of the root, we do\nnot get any gain from asking additional questions. Thus, we set a constant c\u03c4, depending on the pi,\nsuch that if k < c\u03c4, we consider all possible choices of k queries, and if k \u2265 c\u03c4, we may simply pick\nthe k largest pi, because the second term would be very small. We describe this algorithm along\nwith the proof in the extended technical report [11].\n6.2 DAG(\u2227):\nTheorem 6.3 (PP-Hardness of Incr. for DAG(\u2227)). The best-k problem in DAG(\u2227) is PP-Hard.\nThe proof can be found in the extended technical report [11]. Our reduction is a weakly parsimo-\nnious reduction involving monotone-partitioned-2-CNF, which is known to be hard to approximate,\nthus we have the following result:\nTheorem 6.4 (Inapproximability for DAG(\u2227)). The best-k problem for DAG(\u2227) is hard to approxi-\nmate.\n\n7 Conclusion\nIn this work, we performed a detailed complexity analysis for the problem of \ufb01nding optimal set\nof query nodes for various classes of graphs. We showed that for trees, most of the problems are\ntractable, and in fact quite ef\ufb01cient. For general dags, they become hard to even approximate. We\nleave open the complexity of the best-k problem for trees.\n\n8\n\n\fReferences\n[1] Ashraf M. Abdelbar and Sandra M. Hedetniemi. Approximating maps for belief networks is np-hard and\n\nother theorems. Artif. Intell., 102(1):21\u201338, June 1998.\n\n[2] Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. J. Comput. Syst.\n\nSci., 75(1):78\u201389, 2009.\n\n[3] Kedar Bellare, Suresh Iyengar, Aditya Parameswaran, and Vibhor Rastogi. Active sampling for entity\n\nmatching. In KDD, 2012.\n\n[4] Alina Beygelzimer, Sanjoy Dasgupta, and John Langford. Importance weighted active learning. In ICML,\n\npage 7, 2009.\n\n[5] Alina Beygelzimer, Daniel Hsu, John Langford, and Tong Zhang. Agnostic active learning without con-\n\nstraints. In NIPS, pages 199\u2013207, 2010.\n\n[6] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics).\n\nSpringer, 1 edition, 2007.\n\n[7] Philip Bohannon, Srujana Merugu, Cong Yu, Vipul Agarwal, Pedro DeRose, Arun Iyer, Ankur Jain,\nVinay Kakade, Mridul Muralidharan, Raghu Ramakrishnan, and Warren Shen. Purple sox extraction\nmanagement system. SIGMOD Rec., 37:21\u201327, March 2009.\n\n[8] Gregory F. Cooper. The computational complexity of probabilistic inference using bayesian belief net-\n\nworks. Artif. Intell., 42(2-3):393\u2013405, 1990.\n\n[9] Nilesh Dalvi, Ravi Kumar, Bo Pang, Raghu Ramakrishnan, Andrew Tomkins, Philip Bohannon, Sathiya\nKeerthi, and Srujana Merugu. A web of concepts (keynote). In PODS. Providence, Rhode Island, USA,\nJune 2009.\n\n[10] Nilesh Dalvi, Ravi Kumar, and Mohamed A. Soliman. Automatic wrappers for large scale web extraction.\n\nPVLDB, 4(4):219\u2013230, 2011.\n\n[11] Nilesh Dalvi, Aditya Parameswaran, and Vibhor Rastogi. Minimizing uncertainty in pipelines. Technical\n\nreport, Stanford Infolab, 2012.\n\n[12] Sanjoy Dasgupta and John Langford. Tutorial summary: Active learning. In ICML, page 178, 2009.\n[13] Yoav Freund, H. Sebastian Seung, Eli Shamir, and Naftali Tishby. Selective sampling using the query by\n\ncommittee algorithm. Machine Learning, 28(2-3):133\u2013168, 1997.\n\n[14] Pankaj Gulhane, Rajeev Rastogi, Srinivasan H. Sengamedu, and Ashwin Tengli. Exploiting content re-\n\ndundancy for web information extraction. PVLDB, 3(1):578\u2013587, 2010.\n\n[15] Steve Hanneke. A bound on the label complexity of agnostic active learning. In ICML, pages 353\u2013360,\n\n2007.\n\n[16] Don Herbison-Evans. Solving quartics and cubics for graphics. 1994.\n[17] Nikos Karampatziakis and John Langford. Online importance weight aware updates.\n\n392\u2013399, 2011.\n\nIn UAI, pages\n\n[18] Richard M. Karp and Michael Luby. Monte-carlo algorithms for enumeration and reliability problems. In\n\nProceedings of the 24th Annual Symposium on Foundations of Computer Science, pages 56\u201364, 1983.\n\n[19] Andreas Krause and Carlos Guestrin. Near-optimal nonmyopic value of information in graphical models.\n\nIn UAI, pages 324\u2013331, 2005.\n\n[20] Andreas Krause and Carlos Guestrin. Near-optimal observation selection using submodular functions. In\n\nAAAI, pages 1650\u20131654, 2007.\n\n[21] J. Kwisthout. The Computational Complexity of Probabilistic Inference. Technical Report ICIS\u2013R11003,\n\nRadboud University Nijmegen, April 2011.\n\n[22] Michael L. Littman, Stephen M. Majercik, and Toniann Pitassi. Stochastic boolean satis\ufb01ability. J. Autom.\n\nReasoning, 27(3):251\u2013296, 2001.\n\n[23] A. Parameswaran, A. Das Sarma, H. Garcia-Molina, N. Polyzotis, and J. Widom. Human-assisted graph\n\nsearch: it\u2019s okay to ask questions. In VLDB, 2011.\n\n[24] James D. Park and Adnan Darwiche. Complexity results and approximation strategies for map explana-\n\ntions. J. Artif. Intell. Res. (JAIR), 21:101\u2013133, 2004.\n\n[25] J. Scott Provan and Michael O. Ball. The complexity of counting cuts and of computing the probability\n\nthat a graph is connected. SIAM J. Comput., 12(4):777\u2013788, 1983.\n\n[26] Dan Roth. On the hardness of approximate reasoning. Artif. Intell., 82(1-2):273\u2013302, 1996.\n[27] Burr Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of\n\nWisconsin\u2013Madison, 2009.\n\n[28] Alice X. Zheng, Irina Rish, and Alina Beygelzimer. Ef\ufb01cient test selection in active diagnosis via entropy\n\napproximation. In UAI, pages 675\u2013, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1335, "authors": [{"given_name": "Nilesh", "family_name": "Dalvi", "institution": null}, {"given_name": "Aditya", "family_name": "Parameswaran", "institution": null}, {"given_name": "Vibhor", "family_name": "Rastogi", "institution": null}]}